MergeTree table engine

The MergeTree engine and other engines of the MergeTree family (e.g. ReplacingMergeTree, AggregatingMergeTree ) are the most commonly used and most robust table engines in ClickHouse.

MergeTree-family table engines are designed for high data ingest rates and huge data volumes. Insert operations create table parts which are merged by a background process with other table parts.

Main features of MergeTree-family table engines.

The table's primary key determines the sort order within each table part (clustered index). The primary key also does not reference individual rows but blocks of 8192 rows called granules. This makes primary keys of huge data sets small enough to remain loaded in main memory, while still providing fast access to on-disk data.
Tables can be partitioned using an arbitrary partition expression. Partition pruning ensures partitions are omitted from reading when the query allows it.
Data can be replicated across multiple cluster nodes for high availability, failover, and zero downtime upgrades. See Data replication.
MergeTree table engines support various statistics kinds and sampling methods to help query optimization.

Note

Despite a similar name, the Merge engine is different from *MergeTree engines.

Creating tables

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [[NOT] NULL] [DEFAULT|MATERIALIZED|ALIAS|EPHEMERAL expr1] [COMMENT ...] [CODEC(codec1)] [STATISTICS(stat1)] [TTL expr1] [PRIMARY KEY] [SETTINGS (name = value, ...)],
    name2 [type2] [[NOT] NULL] [DEFAULT|MATERIALIZED|ALIAS|EPHEMERAL expr2] [COMMENT ...] [CODEC(codec2)] [STATISTICS(stat2)] [TTL expr2] [PRIMARY KEY] [SETTINGS (name = value, ...)],
    ...
    INDEX index_name1 expr1 TYPE type1(...) [GRANULARITY value1],
    INDEX index_name2 expr2 TYPE type2(...) [GRANULARITY value2],
    ...
    PROJECTION projection_name_1 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY]),
    PROJECTION projection_name_2 (SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY])
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr
    [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx' [, ...] ]
    [WHERE conditions]
    [GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ] ]
[SETTINGS name = value, ...]

For a detailed description of the parameters, see the CREATE TABLE statement

Query clauses

ENGINE

ENGINE — Name and parameters of the engine. ENGINE = MergeTree(). The MergeTree engine has no parameters.

ORDER BY

ORDER BY — The sorting key.

A tuple of column names or arbitrary expressions. Example: ORDER BY (CounterID + 1, EventDate).

If no primary key is defined (i.e. PRIMARY KEY was not specified), ClickHouse uses the sorting key as primary key.

If no sorting is required, you can use syntax ORDER BY tuple(). Alternatively, if setting create_table_empty_primary_key_by_default is enabled, ORDER BY () is implicitly added to CREATE TABLE statements. See Selecting a Primary Key.

PARTITION BY

PARTITION BY — The partitioning key. Optional. In most cases, you don't need a partition key, and if you do need to partition, generally you do not need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).

For partitioning by month, use the toYYYYMM(date_column) expression, where date_column is a column with a date of the type Date. The partition names here have the "YYYYMM" format.

PRIMARY KEY

PRIMARY KEY — The primary key if it differs from the sorting key. Optional.

Specifying a sorting key (using ORDER BY clause) implicitly specifies a primary key. It is usually not necessary to specify the primary key in addition to the sorting key.

SAMPLE BY

SAMPLE BY — A sampling expression. Optional.

If specified, it must be contained in the primary key. The sampling expression must result in an unsigned integer.

Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).

TTL

TTL — A list of rules that specify the storage duration of rows and the logic of automatic parts movement between disks and volumes. Optional.

Expression must result in a Date or DateTime, e.g. TTL date + INTERVAL 1 DAY.

Type of the rule DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'|GROUP BY specifies an action to be done with the part if the expression is satisfied (reaches current time): removal of expired rows, moving a part (if expression is satisfied for all rows in a part) to specified disk (TO DISK 'xxx') or to volume (TO VOLUME 'xxx'), or aggregating values in expired rows. Default type of the rule is removal (DELETE). List of multiple rules can be specified, but there should be no more than one DELETE rule.

For more details, see TTL for columns and tables

SETTINGS

See MergeTree Settings.

Example of Sections Setting

ENGINE MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity=8192

In the example, we set partitioning by month.

We also set an expression for sampling as a hash by the user ID. This allows you to pseudorandomize the data in the table for each CounterID and EventDate. If you define a SAMPLE clause when selecting the data, ClickHouse will return an evenly pseudorandom data sample for a subset of users.

The index_granularity setting can be omitted because 8192 is the default value.

Deprecated Method for Creating a Table

Note

Do not use this method in new projects. If possible, switch old projects to the method described above.

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE [=] MergeTree(date-column [, sampling_expression], (primary, key), index_granularity)

MergeTree() Parameters

date-column — The name of a column of the Date type. ClickHouse automatically creates partitions by month based on this column. The partition names are in the "YYYYMM" format.
sampling_expression — An expression for sampling.
(primary, key) — Primary key. Type: Tuple()
index_granularity — The granularity of an index. The number of data rows between the "marks" of an index. The value 8192 is appropriate for most tasks.

Example

MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)

The MergeTree engine is configured in the same way as in the example above for the main engine configuration method.

Data storage

A table consists of data parts sorted by primary key.

When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is (CounterID, Date), the data in the part is sorted by CounterID, and within each CounterID, it is ordered by Date.

Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. The merge mechanism does not guarantee that all rows with the same primary key will be in the same data part.

Data parts can be stored in Wide or Compact format. In Wide format each column is stored in a separate file in a filesystem, in Compact format all columns are stored in one file. Compact format can be used to increase performance of small and frequent inserts.

Data storing format is controlled by the min_bytes_for_wide_part and min_rows_for_wide_part settings of the table engine. If the number of bytes or rows in a data part is less then the corresponding setting's value, the part is stored in Compact format. Otherwise it is stored in Wide format. If none of these settings is set, data parts are stored in Wide format.

Each data part is logically divided into granules. A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ClickHouse does not split rows or values, so each granule always contains an integer number of rows. The first row of a granule is marked with the value of the primary key for the row. For each data part, ClickHouse creates an index file that stores the marks. For each column, whether it's in the primary key or not, ClickHouse also stores the same marks. These marks let you find data directly in column files.

The granule size is restricted by the index_granularity and index_granularity_bytes settings of the table engine. The number of rows in a granule lays in the [1, index_granularity] range, depending on the size of the rows. The size of a granule can exceed index_granularity_bytes if the size of a single row is greater than the value of the setting. In this case, the size of the granule equals the size of the row.

Primary Keys and Indexes in Queries

Take the (CounterID, Date) primary key as an example. In this case, the sorting and index can be illustrated as follows:

Whole data:     [---------------------------------------------]
CounterID:      [aaaaaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
Date:           [1111111222222233331233211111222222333211111112122222223111112223311122333]
Marks:           |      |      |      |      |      |      |      |      |      |      |
                a,1    a,2    a,3    b,3    e,2    e,3    g,1    h,2    i,1    i,3    l,3
Marks numbers:   0      1      2      3      4      5      6      7      8      9      10

If the data query specifies:

CounterID in ('a', 'h'), the server reads the data in the ranges of marks [0, 3) and [6, 8).
CounterID IN ('a', 'h') AND Date = 3, the server reads the data in the ranges of marks [1, 3) and [7, 8).
Date = 3, the server reads the data in the range of marks [1, 10].

The examples above show that it is always more effective to use an index than a full scan.

A sparse index allows extra data to be read. When reading a single range of the primary key, up to index_granularity * 2 extra rows in each data block can be read.

Sparse indexes allow you to work with a very large number of table rows, because in most cases, such indexes fit in the computer's RAM.

ClickHouse does not require a unique primary key. You can insert multiple rows with the same primary key.

You can use Nullable-typed expressions in the PRIMARY KEY and ORDER BY clauses but it is strongly discouraged. To allow this feature, turn on the allow_nullable_key setting. The NULLS_LAST principle applies for NULL values in the ORDER BY clause.

Selecting a primary key

The number of columns in the primary key is not explicitly limited. Depending on the data structure, you can include more or fewer columns in the primary key. This may:

Improve the performance of an index.

If the primary key is (a, b), then adding another column c will improve the performance if the following conditions are met:
- There are queries with a condition on column c.
- Long data ranges (several times longer than the index_granularity) with identical values for (a, b) are common. In other words, when adding another column allows you to skip quite long data ranges.
Improve data compression.

ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.
Provide additional logic when merging data parts in the CollapsingMergeTree and SummingMergeTree engines.

In this case it makes sense to specify the sorting key that is different from the primary key.

A long primary key will negatively affect the insert performance and memory consumption, but extra columns in the primary key do not affect ClickHouse performance during SELECT queries.

You can create a table without a primary key using the ORDER BY tuple() syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by INSERT ... SELECT queries, set max_insert_threads = 1.

To select data in the initial order, use single-threaded SELECT queries.

Choosing a primary key that differs from the sorting key

It is possible to specify a primary key (an expression with values that are written in the index file for each mark) that is different from the sorting key (an expression for sorting the rows in data parts). In this case the primary key expression tuple must be a prefix of the sorting key expression tuple.

This feature is helpful when using the SummingMergeTree and AggregatingMergeTree table engines. In a common case when using these engines, the table has two types of columns: dimensions and measures. Typical queries aggregate values of measure columns with arbitrary GROUP BY and filtering by dimensions. Because SummingMergeTree and AggregatingMergeTree aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result, the key expression consists of a long list of columns and this list must be frequently updated with newly added dimensions.

In this case it makes sense to leave only a few columns in the primary key that will provide efficient range scans and add the remaining dimension columns to the sorting key tuple.

ALTER of the sorting key is a lightweight operation because when a new column is simultaneously added to the table and to the sorting key, existing data parts do not need to be changed. Since the old sorting key is a prefix of the new sorting key and there is no data in the newly added column, the data is sorted by both the old and new sorting keys at the moment of table modification.

Use of indexes and partitions in queries

For SELECT queries, ClickHouse analyzes whether an index can be used. An index can be used if the WHERE/PREWHERE clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has IN or LIKE with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.

Thus, it is possible to quickly run queries on one or many ranges of the primary key. In this example, queries will be fast when run for a specific tracking tag, for a specific tag and date range, for a specific tag and date, for multiple tags with a date range, and so on.

Let's look at the engine configured as follows:

ENGINE MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate)
SETTINGS index_granularity=8192

In this case, in queries:

SELECT count() FROM table
WHERE EventDate = toDate(now())
AND CounterID = 34

SELECT count() FROM table
WHERE EventDate = toDate(now())
AND (CounterID = 34 OR CounterID = 42)

SELECT count() FROM table
WHERE ((EventDate >= toDate('2014-01-01')
AND EventDate <= toDate('2014-01-31')) OR EventDate = toDate('2014-05-01'))
AND CounterID IN (101500, 731962, 160656)
AND (CounterID = 101500 OR EventDate != toDate('2014-05-01'))

ClickHouse will use the primary key index to trim improper data and the monthly partitioning key to trim partitions that are in improper date ranges.

The queries above show that the index is used even for complex expressions. Reading from the table is organized so that using the index can't be slower than a full scan.

In the example below, the index can't be used.

SELECT count() FROM table WHERE CounterID = 34 OR URL LIKE '%upyachka%'

To check whether ClickHouse can use the index when running a query, use the settings force_index_by_date and force_primary_key.

The key for partitioning by month allows reading only those data blocks which contain dates from the proper range. In this case, the data block may contain data for many dates (up to an entire month). Within a block, data is sorted by primary key, which might not contain the date as the first column. Because of this, using a query with only a date condition that does not specify the primary key prefix will cause more data to be read than for a single date.

Use of index for deterministic expressions in primary keys

The primary key can contain expressions, not only column names. These expressions are not limited to simple function chains: they can be arbitrary expression trees (for example, nested functions and composite expressions), as long as they are deterministic.

An expression is deterministic if it always returns the same result for the same input values (for example: length(), toDate(), lower(), left(), cityHash64(), toUUID(); unlike now() or rand()). If the primary key contains deterministic expressions, ClickHouse can apply them to constant values from the query and use the result to build conditions on the primary key index. This enables data skipping for predicates like =, IN, and has.

A common use case is to keep the primary key compact (e.g. store a hash instead of a long String), while still allowing predicates on the original column to use the index.

Example of a deterministic (but non-injective) primary key:

ENGINE = MergeTree()
ORDER BY length(user_id)

Example predicates that can use the index:

SELECT * FROM table WHERE user_id = 'alice';
SELECT * FROM table WHERE user_id IN ('alice', 'bob');
SELECT * FROM table WHERE has(['alice', 'bob'], user_id);

In these cases, ClickHouse computes length('alice') (and other constants) once and uses the length values to narrow the ranges in the primary key index. Since length of a string is not injective, different user_id strings can share the same length, so the index may read extra granules (false positives). The result remains correct because the original predicate (user_id = ..., IN, etc.) is still applied after reading.

If the deterministic expression is also injective (different inputs cannot produce the same output for the argument types used), additionally ClickHouse can effectively use the index for the negated forms: !=, NOT IN, and NOT has(...). For example, reverse(p) and hex(p) are injective for String.

Example of an injective primary key:

ENGINE = MergeTree()
ORDER BY hex(p)

More complex injective expressions are also supported, for example:

ENGINE = MergeTree()
ORDER BY reverse(tuple(reverse(p), hex(p)))

Example predicates that can use the index:

SELECT * FROM table WHERE p != 'abc';
SELECT * FROM table WHERE p NOT IN ('abc', '12345');
SELECT * FROM table WHERE NOT has(['abc', '12345'], p);

Use of index for partially-monotonic primary keys

Consider, for example, the days of the month. They form a monotonic sequence for one month, but not monotonic for more extended periods. This is a partially-monotonic sequence. If a user creates the table with partially-monotonic primary key, ClickHouse creates a sparse index as usual. When a user selects data from this kind of table, ClickHouse analyzes the query conditions. If the user wants to get data between two marks of the index and both these marks fall within one month, ClickHouse can use the index in this particular case because it can calculate the distance between the parameters of a query and index marks.

ClickHouse cannot use an index if the values of the primary key in the query parameter range do not represent a monotonic sequence. In this case, ClickHouse uses the full scan method.

ClickHouse uses this logic not only for days of the month sequences, but for any primary key that represents a partially-monotonic sequence.

Data skipping indexes

The index declaration is in the columns section of the CREATE query.

INDEX index_name expr TYPE type(...) [GRANULARITY granularity_value]

For tables from the *MergeTree family, data skipping indices can be specified.

These indices aggregate some information about the specified expression on blocks, which consist of granularity_value granules (the size of the granule is specified using the index_granularity setting in the table engine). Then these aggregates are used in SELECT queries for reducing the amount of data to read from the disk by skipping big blocks of data where the where query cannot be satisfied.

The GRANULARITY clause can be omitted, the default value of granularity_value is 1.

Example

CREATE TABLE table_name
(
    u64 UInt64,
    i32 Int32,
    s String,
    ...
    INDEX idx1 u64 TYPE bloom_filter GRANULARITY 3,
    INDEX idx2 u64 * i32 TYPE minmax GRANULARITY 3,
    INDEX idx3 u64 * length(s) TYPE set(1000) GRANULARITY 4
) ENGINE = MergeTree()
...

Indices from the example can be used by ClickHouse to reduce the amount of data to read from disk in the following queries:

SELECT count() FROM table WHERE u64 == 10;
SELECT count() FROM table WHERE u64 * i32 >= 1234
SELECT count() FROM table WHERE u64 * length(s) == 1234

Data skipping indexes can also be created on composite columns:

-- on columns of type Map:
INDEX map_key_index mapKeys(map_column) TYPE bloom_filter
INDEX map_value_index mapValues(map_column) TYPE bloom_filter

-- on columns of type JSON:
INDEX json_paths_index JSONAllPaths(json_column) TYPE bloom_filter

-- on columns of type Tuple:
INDEX tuple_1_index tuple_column.1 TYPE bloom_filter
INDEX tuple_2_index tuple_column.2 TYPE bloom_filter

-- on columns of type Nested:
INDEX nested_1_index col.nested_col1 TYPE bloom_filter
INDEX nested_2_index col.nested_col2 TYPE bloom_filter

Skip Index Types

The MergeTree table engine supports the following types of skip indexes. For more information on how skip indexes can be used for performance optimization see "Understanding ClickHouse data skipping indexes".

MinMax index
Set index
bloom_filter index
ngrambf_v1 index (Deprecated)
tokenbf_v1 index (Deprecated)
text index
vector_similarity index

MinMax skip index

For each index granule, the minimum and maximum values of an expression are stored. (If the expression is of type tuple, it stores the minimum and maximum for each tuple element.)

minmax

Set

For each index granule at most max_rows many unique values of the specified expression are stored. max_rows = 0 means "store all unique values".

set(max_rows)

Bloom filter

For each index granule stores a bloom filter for the specified columns.

bloom_filter([false_positive_rate])

The false_positive_rate parameter can take on a value between 0 and 1 (by default: 0.025) and specifies the probability of generating a positive (which increases the amount of data to be read).

The following data types are supported:

(U)Int*
Float*
Enum
Date
DateTime
String
FixedString
Array
LowCardinality
Nullable
UUID
Map

Map data type: specifying index creation with keys or values

For the Map data type, the client can specify if the index should be created for keys or for values using the mapKeys or mapValues functions.

JSON data type: indexing JSON paths

For the JSON data type, a bloom filter index can be created on the set of paths using the JSONAllPaths function. This allows skipping granules where a queried JSON path is absent. See Data skipping indexes for JSON for details.

N-gram bloom filter (Deprecated)

Note

With general availability (GA) of the text index starting from ClickHouse version 26.2, the ngrambf_v1 index is no longer recommended for full text search.

See page "Full-text search with text indexes" for details.

For each index granule stores a bloom filter for the n-grams of the specified columns.

ngrambf_v1(n, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

Parameter	Description
`n`	ngram size
`size_of_bloom_filter_in_bytes`	Bloom filter size in bytes. You can use a large value here, for example, `256` or `512`, because it can be compressed well).
`number_of_hash_functions`	The number of hash functions used in the bloom filter.
`random_seed`	Seed for the bloom filter hash functions.

This index only works with the following data types:

To estimate the parameters of ngrambf_v1, you can use the following User Defined Functions (UDFs).

CREATE FUNCTION bfEstimateFunctions [ON CLUSTER cluster]
AS
(total_number_of_all_grams, size_of_bloom_filter_in_bits) -> round((size_of_bloom_filter_in_bits / total_number_of_all_grams) * log(2));

CREATE FUNCTION bfEstimateBmSize [ON CLUSTER cluster]
AS
(total_number_of_all_grams,  probability_of_false_positives) -> ceil((total_number_of_all_grams * log(probability_of_false_positives)) / log(1 / pow(2, log(2))));

CREATE FUNCTION bfEstimateFalsePositive [ON CLUSTER cluster]
AS
(total_number_of_all_grams, number_of_hash_functions, size_of_bloom_filter_in_bytes) -> pow(1 - exp(-number_of_hash_functions/ (size_of_bloom_filter_in_bytes / total_number_of_all_grams)), number_of_hash_functions);

CREATE FUNCTION bfEstimateGramNumber [ON CLUSTER cluster]
AS
(number_of_hash_functions, probability_of_false_positives, size_of_bloom_filter_in_bytes) -> ceil(size_of_bloom_filter_in_bytes / (-number_of_hash_functions / log(1 - exp(log(probability_of_false_positives) / number_of_hash_functions))))

To use these functions, you need to specify at least two parameters:

total_number_of_all_grams
probability_of_false_positives

For example, there are 4300 ngrams in the granule and you expect false positives to be less than 0.0001. The other parameters can then be estimated by executing the following queries:

--- estimate number of bits in the filter
SELECT bfEstimateBmSize(4300, 0.0001) / 8 AS size_of_bloom_filter_in_bytes;

┌─size_of_bloom_filter_in_bytes─┐
│                         10304 │
└───────────────────────────────┘

--- estimate number of hash functions
SELECT bfEstimateFunctions(4300, bfEstimateBmSize(4300, 0.0001)) as number_of_hash_functions

┌─number_of_hash_functions─┐
│                       13 │
└──────────────────────────┘

Of course, you can also use those functions to estimate parameters for other conditions. The functions above refer to the bloom filter calculator here.

Token bloom filter

Note

With general availability (GA) of the text index starting from ClickHouse version 26.2, the tokenbf_v1 index is no longer recommended for full text search.

See page "Full-text search with text indexes" for details.

tokenbf_v1(size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

Sparse grams bloom filter

The sparse grams bloom filter is similar to ngrambf_v1 but uses sparse grams tokens instead of ngrams.

sparse_grams(min_ngram_length, max_ngram_length, min_cutoff_length, size_of_bloom_filter_in_bytes, number_of_hash_functions, random_seed)

Text index

Builds an inverted index over tokenized string data, enabling efficient and deterministic full-text search. See here for details.

Vector similarity

Supports approximate nearest neighbor search, see here for details.

Functions support

Conditions in the WHERE clause contains calls of the functions that operate with columns. If the column is a part of an index, ClickHouse tries to use this index when performing the functions. ClickHouse supports different subsets of functions for using indexes.

Indexes of type set can be utilized by all functions. The other index types are supported as follows:

Function (operator) / Index	primary key	minmax	ngrambf_v1	tokenbf_v1	bloom_filter	sparse_grams	text
equals (=, ==)	✔	✔	✔	✔	✔	✔	✔
notEquals(!=, <>)	✔	✔	✔	✔	✔	✔	✗
like	✔	✔	✔	✔	✗	✔	✔
notLike	✔	✔	✔	✔	✗	✔	✗
match	✗	✗	✔	✔	✗	✔	✔
startsWith	✔	✔	✔	✔	✗	✔	✔
endsWith	✗	✗	✔	✔	✗	✔	✔
multiSearchAny	✗	✗	✔	✗	✗	✗	✔
multiSearchAnyUTF8	✗	✗	✗	✗	✗	✗	✔
multiMatchAny	✗	✗	✗	✗	✗	✗	✔
in	✔	✔	✔	✔	✔	✔	✔
notIn	✔	✔	✔	✔	✔	✔	✗
less (`<`)	✔	✔	✗	✗	✗	✗	✗
greater (`>`)	✔	✔	✗	✗	✗	✗	✗
lessOrEquals (`<=`)	✔	✔	✗	✗	✗	✗	✗
greaterOrEquals (`>=`)	✔	✔	✗	✗	✗	✗	✗
empty	✔	✔	✗	✗	✗	✗	✗
notEmpty	✗	✔	✗	✗	✗	✔	✗
has	✔	✔	✔	✔	✔	✔	✔
hasAny	✗	✗	✔	✔	✔	✔	✗
hasAll	✗	✗	✔	✔	✔	✔	✗
hasToken	✗	✗	✗	✔	✗	✗	✔
hasTokenOrNull	✗	✗	✗	✔	✗	✗	✔
hasTokenCaseInsensitive (`*`)	✗	✗	✗	✔	✗	✗	✗
hasTokenCaseInsensitiveOrNull (`*`)	✗	✗	✗	✔	✗	✗	✗
hasAnyTokens	✗	✗	✗	✗	✗	✗	✔
hasAllTokens	✗	✗	✗	✗	✗	✗	✔
pointInPolygon	✔	✔	✗	✗	✗	✗	✗
mapContains (mapContainsKey)	✗	✗	✗	✗	✗	✗	✔
mapContainsKeyLike	✗	✗	✗	✗	✗	✗	✔
mapContainsValue	✗	✗	✗	✗	✗	✗	✔
mapContainsValueLike	✗	✗	✗	✗	✗	✗	✔

Functions with a constant argument that is less than ngram size can't be used by ngrambf_v1 for query optimization.

(*) For hasTokenCaseInsensitive and hasTokenCaseInsensitiveOrNull to be effective, the tokenbf_v1 index must be created on lowercased data, for example INDEX idx (lower(str_col)) TYPE tokenbf_v1(512, 3, 0).

Note

Bloom filters can have false positive matches, so the ngrambf_v1, tokenbf_v1, sparse_grams, and bloom_filter indexes can not be used for optimizing queries where the result of a function is expected to be false.

For example:

Can be optimized:
- s LIKE '%test%'
- NOT s NOT LIKE '%test%'
- s = 1
- NOT s != 1
- startsWith(s, 'test')
Can not be optimized:
- NOT s LIKE '%test%'
- s NOT LIKE '%test%'
- NOT s = 1
- s != 1
- NOT startsWith(s, 'test')

Projections

Projections are like materialized views but defined in part-level. It provides consistency guarantees along with automatic usage in queries.

Note

When you are implementing projections you should also consider the force_optimize_projection setting.

Projections are not supported in the SELECT statements with the FINAL modifier.

Projection query

A projection query is what defines a projection. It implicitly selects data from the parent table. Syntax

SELECT <column list expr> [GROUP BY] <group keys expr> [ORDER BY] <expr>

Projections can be modified or dropped with the ALTER statement.

Projection indexes

Projection indexes extend the projection subsystem by providing a lightweight and explicit way to define projection-level indexes. Externally, a projection index is still a projection, but with simplified syntax and clearer intent: it defines an expression which is dedicated to filtering, rather than serving materialized data. Internally, a projection index does not materialize the original table in permuted row order like a regular projection. Instead, the permutation is stored in the form of a numeric permutation column _part_offset, i.e. SELECT _part_offset ORDER BY <index_expr>.

Syntax

PROJECTION <name> INDEX <index_expr> TYPE <index_type>

Example:

CREATE TABLE example
(
    id UInt64,
    region String,
    user_id UInt32,
    PROJECTION region_proj INDEX region TYPE basic,
    PROJECTION uid_proj INDEX user_id TYPE basic
)
ENGINE = MergeTree
ORDER BY id;

Index types

Currently supported:

basic: equivalent to a normal MergeTree index on the expression.

The framework allows adding more index types in the future.

Projection storage

Projections are stored inside the part directory. It's similar to an index but contains a subdirectory that stores an anonymous MergeTree table's part. The table is induced by the definition query of the projection. If there is a GROUP BY clause, the underlying storage engine becomes AggregatingMergeTree, and all aggregate functions are converted to AggregateFunction. If there is an ORDER BY clause, the MergeTree table uses it as its primary key expression. During the merge process the projection part is merged via its storage's merge routine. The checksum of the parent table's part is combined with the projection's part. Other maintenance jobs are similar to skip indices.

Query analysis

Check if the projection can be used to answer the given query, that is, it generates the same answer as querying the base table.
Select the best feasible match, which contains the least granules to read.
The query pipeline which uses projections will be different from the one that uses the original parts. If the projection is absent in some parts, we can add the pipeline to "project" it on the fly.

Concurrent data access

For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations.

Reading from a table is automatically parallelized.

TTL for columns and tables

Determines the lifetime of values.

The TTL clause can be set for the whole table and for each individual column. Table-level TTL can also specify the logic of automatic moving data between disks and volumes, or recompressing parts where all the data has been expired.

Expressions must evaluate to Date, Date32, DateTime or DateTime64 data type.

Avoid non-deterministic functions in TTL expressions

TTL is evaluated during background merges, and not at insert time. Functions like rand(), now(), or now64() will be re-evaluated on every merge, leading to unpredictable deletion behavior. ClickHouse blocks expressions with no column dependency at all, but does not currently reject non-deterministic functions mixed with a column reference (e.g. ts + rand()). TTL expressions should be based solely on deterministic, column-derived values for predictable results.

Syntax

Setting time-to-live for a column:

TTL time_column
TTL time_column + interval

To define interval, use time interval operators, for example:

TTL date_time + INTERVAL 1 MONTH
TTL date_time + INTERVAL 15 HOUR

Column TTL

When the values in the column expire, ClickHouse replaces them with the default values for the column data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in a filesystem.

The TTL clause can't be used for key columns.

Examples

Creating a table with `TTL`:

CREATE TABLE tab
(
    d DateTime,
    a Int TTL d + INTERVAL 1 MONTH,
    b Int TTL d + INTERVAL 1 MONTH,
    c String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d;

Adding TTL to a column of an existing table

ALTER TABLE tab
    MODIFY COLUMN
    c String TTL d + INTERVAL 1 DAY;

Altering TTL of the column

ALTER TABLE tab
    MODIFY COLUMN
    c String TTL d + INTERVAL 1 MONTH;

Table TTL

Table can have an expression for removal of expired rows, and multiple expressions for automatic move of parts between disks or volumes. When rows in the table expire, ClickHouse deletes all corresponding rows. For parts moving or recompressing, all rows of a part must satisfy the TTL expression criteria.

TTL expr
    [DELETE|RECOMPRESS codec_name1|TO DISK 'xxx'|TO VOLUME 'xxx'][, DELETE|RECOMPRESS codec_name2|TO DISK 'aaa'|TO VOLUME 'bbb'] ...
    [WHERE conditions]
    [GROUP BY key_expr [SET v1 = aggr_func(v1) [, v2 = aggr_func(v2) ...]] ]

Type of TTL rule may follow each TTL expression. It affects an action which is to be done once the expression is satisfied (reaches current time):

DELETE - delete expired rows (default action);
RECOMPRESS codec_name - recompress data part with the codec_name;
TO DISK 'aaa' - move part to the disk aaa;
TO VOLUME 'bbb' - move part to the disk bbb;
GROUP BY - aggregate expired rows.

DELETE action can be used together with WHERE clause to delete only some of the expired rows based on a filtering condition:

TTL time_column + INTERVAL 1 MONTH DELETE WHERE column = 'value'

GROUP BY expression must be a prefix of the table primary key.

If a column is not part of the GROUP BY expression and is not set explicitly in the SET clause, in result row it contains an occasional value from the grouped rows (as if aggregate function any is applied to it).

Examples

Creating a table with `TTL`:

CREATE TABLE tab
(
    d DateTime,
    a Int
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d
TTL d + INTERVAL 1 MONTH DELETE,
    d + INTERVAL 1 WEEK TO VOLUME 'aaa',
    d + INTERVAL 2 WEEK TO DISK 'bbb';

Altering `TTL` of the table:

ALTER TABLE tab
    MODIFY TTL d + INTERVAL 1 DAY;

Creating a table, where the rows are expired after one month. The expired rows where dates are Mondays are deleted:

CREATE TABLE table_with_where
(
    d DateTime,
    a Int
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(d)
ORDER BY d
TTL d + INTERVAL 1 MONTH DELETE WHERE toDayOfWeek(d) = 1;

Creating a table, where expired rows are recompressed:

CREATE TABLE table_for_recompression
(
    d DateTime,
    key UInt64,
    value String
) ENGINE MergeTree()
ORDER BY tuple()
PARTITION BY key
TTL d + INTERVAL 1 MONTH RECOMPRESS CODEC(ZSTD(17)), d + INTERVAL 1 YEAR RECOMPRESS CODEC(LZ4HC(10))
SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0;

Creating a table, where expired rows are aggregated. In result rows x contains the maximum value across the grouped rows, y — the minimum value, and d — any occasional value from grouped rows.

CREATE TABLE table_for_aggregation
(
    d DateTime,
    k1 Int,
    k2 Int,
    x Int,
    y Int
)
ENGINE = MergeTree
ORDER BY (k1, k2)
TTL d + INTERVAL 1 MONTH GROUP BY k1, k2 SET x = max(x), y = min(y);

Removing expired data

Data with an expired TTL is removed when ClickHouse merges data parts.

When ClickHouse detects that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set merge_with_ttl_timeout. If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.

If you perform the SELECT query between merges, you may get expired data. To avoid it, use the OPTIMIZE query before SELECT.

See Also

ttl_only_drop_parts setting

Disk types

In addition to local block devices, ClickHouse supports these storage types:

Using multiple block devices for data storage

Introduction

MergeTree family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into "hot" and "cold". The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely. If several disks are available, the "hot" data may be located on fast disks (for example, NVMe SSDs or in memory), while the "cold" data - on relatively slow ones (for example, HDD).

This applies to all disk types, including S3 and other object storage disks. For example, you can spread data across multiple S3 buckets within a single volume, or create tiered policies that move data from local disks to S3. See Using S3 disks with multiple volumes for details.

Data part is the minimum movable unit for MergeTree-engine tables. The data belonging to one part are stored on one disk. Data parts can be moved between disks in the background (according to user settings) as well as by means of the ALTER queries.

Terms

Disk — Block device mounted to the filesystem.
Default disk — Disk that stores the path specified in the path server setting.
Volume — Ordered set of equal disks (similar to JBOD).
Storage policy — Set of volumes and the rules for moving data between them.

The names given to the described entities can be found in the system tables, system.storage_policies and system.disks. To apply one of the configured storage policies for a table, use the storage_policy setting of MergeTree-engine family tables.

Configuration

Disks, volumes and storage policies should be declared inside the <storage_configuration> tag either in a file in the config.d directory.

Tip

Disks can also be declared in the SETTINGS section of a query. This is useful for ad-hoc analysis to temporarily attach a disk that is, for example, hosted at a URL. See dynamic storage for more details.

Configuration structure:

<storage_configuration>
    <disks>
        <disk_name_1> <!-- disk name -->
            <path>/mnt/fast_ssd/clickhouse/</path>
        </disk_name_1>
        <disk_name_2>
            <path>/mnt/hdd1/clickhouse/</path>
            <keep_free_space_bytes>10485760</keep_free_space_bytes>
        </disk_name_2>
        <disk_name_3>
            <path>/mnt/hdd2/clickhouse/</path>
            <keep_free_space_bytes>10485760</keep_free_space_bytes>
        </disk_name_3>

        ...
    </disks>

    ...
</storage_configuration>

Tags:

<disk_name_N> — Disk name. Names must be different for all disks.
path — path under which a server will store data (data and shadow folders), should be terminated with '/'.
keep_free_space_bytes — the amount of free disk space to be reserved.

The order of the disk definition is not important.

Storage policies configuration markup:

<storage_configuration>
    ...
    <policies>
        <policy_name_1>
            <volumes>
                <volume_name_1>
                    <disk>disk_name_from_disks_configuration</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                    <load_balancing>round_robin</load_balancing>
                </volume_name_1>
                <volume_name_2>
                    <!-- configuration -->
                </volume_name_2>
                <!-- more volumes -->
            </volumes>
            <move_factor>0.2</move_factor>
        </policy_name_1>
        <policy_name_2>
            <!-- configuration -->
        </policy_name_2>

        <!-- more policies -->
    </policies>
    ...
</storage_configuration>

Tags:

policy_name_N — Policy name. Policy names must be unique.
volume_name_N — Volume name. Volume names must be unique.
disk — a disk within a volume.
max_data_part_size_bytes — the maximum size of a part that can be stored on any of the volume's disks. If the a size of a merged part estimated to be bigger than max_data_part_size_bytes then this part will be written to a next volume. Basically this feature allows to keep new/small parts on a hot (SSD) volume and move them to a cold (HDD) volume when they reach large size. Do not use this setting if your policy has only one volume.
move_factor — when the amount of available space gets lower than this factor, data automatically starts to move on the next volume if any (by default, 0.1). ClickHouse sorts existing parts by size from largest to smallest (in descending order) and selects parts with the total size that is sufficient to meet the move_factor condition. If the total size of all parts is insufficient, all parts will be moved.
perform_ttl_move_on_insert — Disables TTL move on data part INSERT. By default (if enabled) if we insert a data part that already expired by the TTL move rule it immediately goes to a volume/disk declared in move rule. This can significantly slowdown insert in case if destination volume/disk is slow (e.g. S3). If disabled then already expired data part is written into a default volume and then right after moved to TTL volume.
load_balancing - Policy for disk balancing, round_robin or least_used.
least_used_ttl_ms - Configure timeout (in milliseconds) for the updating available space on all disks (0 - update always, -1 - never update, default is 60000). Note, if the disk can be used by ClickHouse only and is not subject to a online filesystem resize/shrink you can use -1, in all other cases it is not recommended, since eventually it will lead to incorrect space distribution.
prefer_not_to_merge — You should not use this setting. Disables merging of data parts on this volume (this is harmful and leads to performance degradation). When this setting is enabled (don't do it), merging data on this volume is not allowed (which is bad). This allows (but you don't need it) controlling (if you want to control something, you're making a mistake) how ClickHouse works with slow disks (but ClickHouse knows better, so please don't use this setting).
volume_priority — Defines the priority (order) in which volumes are filled. Lower value means higher priority. The parameter values should be natural numbers and collectively cover the range from 1 to N (lowest priority given) without skipping any numbers.
- If all volumes are tagged, they are prioritized in given order.
- If only some volumes are tagged, those without the tag have the lowest priority, and they are prioritized in the order they are defined in config.
- If no volumes are tagged, their priority is set correspondingly to their order they are declared in configuration.
- Two volumes cannot have the same priority value.

Configuration examples:

<storage_configuration>
    ...
    <policies>
        <hdd_in_order> <!-- policy name -->
            <volumes>
                <single> <!-- volume name -->
                    <disk>disk1</disk>
                    <disk>disk2</disk>
                </single>
            </volumes>
        </hdd_in_order>

        <moving_from_ssd_to_hdd>
            <volumes>
                <hot>
                    <disk>fast_ssd</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </hot>
                <cold>
                    <disk>disk1</disk>
                </cold>
            </volumes>
            <move_factor>0.2</move_factor>
        </moving_from_ssd_to_hdd>

        <small_jbod_with_external_no_merges>
            <volumes>
                <main>
                    <disk>jbod1</disk>
                </main>
                <external>
                    <disk>external</disk>
                </external>
            </volumes>
        </small_jbod_with_external_no_merges>
    </policies>
    ...
</storage_configuration>

In given example, the hdd_in_order policy implements the round-robin approach. Thus this policy defines only one volume (single), the data parts are stored on all its disks in circular order. Such policy can be quite useful if there are several similar disks are mounted to the system, but RAID is not configured. Keep in mind that each individual disk drive is not reliable and you might want to compensate it with replication factor of 3 or more.

If there are different kinds of disks available in the system, moving_from_ssd_to_hdd policy can be used instead. The volume hot consists of an SSD disk (fast_ssd), and the maximum size of a part that can be stored on this volume is 1GB. All the parts with the size larger than 1GB will be stored directly on the cold volume, which contains an HDD disk disk1. Also, once the disk fast_ssd gets filled by more than 80%, data will be transferred to the disk1 by a background process.

The order of volume enumeration within a storage policy is important in case at least one of the volumes listed has no explicit volume_priority parameter. Once a volume is overfilled, data are moved to the next one. The order of disk enumeration is important as well because data are stored on them in turns.

When creating a table, one can apply one of the configured storage policies to it:

CREATE TABLE table_with_non_default_policy (
    EventDate Date,
    OrderID UInt64,
    BannerID UInt64,
    SearchPhrase String
) ENGINE = MergeTree
ORDER BY (OrderID, BannerID)
PARTITION BY toYYYYMM(EventDate)
SETTINGS storage_policy = 'moving_from_ssd_to_hdd'

The default storage policy implies using only one volume, which consists of only one disk given in <path>. You could change storage policy after table creation with [ALTER TABLE ... MODIFY SETTING] query, new policy should include all old disks and volumes with same names.

The number of threads performing background moves of data parts can be changed by background_move_pool_size setting.

Details

In the case of MergeTree tables, data is getting to disk in different ways:

As a result of an insert (INSERT query).
During background merges and mutations.
When downloading from another replica.
As a result of partition freezing ALTER TABLE ... FREEZE PARTITION.

In all these cases except for mutations and partition freezing, a part is stored on a volume and a disk according to the given storage policy:

The first volume (in the order of definition) that has enough disk space for storing a part (unreserved_space > current_part_size) and allows for storing parts of a given size (max_data_part_size_bytes > current_part_size) is chosen.
Within this volume, that disk is chosen that follows the one, which was used for storing the previous chunk of data, and that has free space more than the part size (unreserved_space - keep_free_space_bytes > current_part_size).

Under the hood, mutations and partition freezing make use of hard links. Hard links between different disks are not supported, therefore in such cases the resulting parts are stored on the same disks as the initial ones.

In the background, parts are moved between volumes on the basis of the amount of free space (move_factor parameter) according to the order the volumes are declared in the configuration file. Data is never transferred from the last one and into the first one. One may use system tables system.part_log (field type = MOVE_PART) and system.parts (fields path and disk) to monitor background moves. Also, the detailed information can be found in server logs.

User can force moving a part or a partition from one volume to another using the query ALTER TABLE ... MOVE PART|PARTITION ... TO VOLUME|DISK ..., all the restrictions for background operations are taken into account. The query initiates a move on its own and does not wait for background operations to be completed. User will get an error message if not enough free space is available or if any of the required conditions are not met.

Moving data does not interfere with data replication. Therefore, different storage policies can be specified for the same table on different replicas.

After the completion of background merges and mutations, old parts are removed only after a certain amount of time (old_parts_lifetime). During this time, they are not moved to other volumes or disks. Therefore, until the parts are finally removed, they are still taken into account for evaluation of the occupied disk space.

User can assign new big parts to different disks of a JBOD volume in a balanced way using the min_bytes_to_rebalance_partition_over_jbod setting.

Using external storage for data storage

MergeTree family table engines can store data to S3, AzureBlobStorage, HDFS using a disk with types s3, azure_blob_storage, hdfs accordingly. See configuring external storage options for more details.

Example for S3 as external storage using a disk with type s3.

Configuration markup:

<storage_configuration>
    ...
    <disks>
        <s3>
            <type>s3</type>
            <support_batch_delete>true</support_batch_delete>
            <endpoint>https://clickhouse-public-datasets.s3.amazonaws.com/my-bucket/root-path/</endpoint>
            <access_key_id>your_access_key_id</access_key_id>
            <secret_access_key>your_secret_access_key</secret_access_key>
            <region></region>
            <header>Authorization: Bearer SOME-TOKEN</header>
            <server_side_encryption_customer_key_base64>your_base64_encoded_customer_key</server_side_encryption_customer_key_base64>
            <server_side_encryption_kms_key_id>your_kms_key_id</server_side_encryption_kms_key_id>
            <server_side_encryption_kms_encryption_context>your_kms_encryption_context</server_side_encryption_kms_encryption_context>
            <server_side_encryption_kms_bucket_key_enabled>true</server_side_encryption_kms_bucket_key_enabled>
            <proxy>
                <uri>http://proxy1</uri>
                <uri>http://proxy2</uri>
            </proxy>
            <connect_timeout_ms>10000</connect_timeout_ms>
            <request_timeout_ms>5000</request_timeout_ms>
            <retry_attempts>10</retry_attempts>
            <single_read_retries>4</single_read_retries>
            <min_bytes_for_seek>1000</min_bytes_for_seek>
            <metadata_path>/var/lib/clickhouse/disks/s3/</metadata_path>
            <skip_access_check>false</skip_access_check>
        </s3>
        <s3_cache>
            <type>cache</type>
            <disk>s3</disk>
            <path>/var/lib/clickhouse/disks/s3_cache/</path>
            <max_size>10Gi</max_size>
        </s3_cache>
    </disks>
    ...
</storage_configuration>

Also see configuring external storage options.

Using S3 disks with multiple volumes

S3 (and other object storage) disks can be used in multi-disk and multi-volume storage policies the same way as local disks. This allows you to spread data across multiple S3 buckets within a single volume (JBOD-style), or set up tiered storage policies with S3 volumes.

For example, to distribute data across two S3 buckets in a round-robin fashion:

<storage_configuration>
    <disks>
        <s3_bucket1>
            <type>s3</type>
            <endpoint>https://s3.amazonaws.com/bucket-1/data/</endpoint>
            <access_key_id>your_access_key_id</access_key_id>
            <secret_access_key>your_secret_access_key</secret_access_key>
        </s3_bucket1>
        <s3_bucket2>
            <type>s3</type>
            <endpoint>https://s3.amazonaws.com/bucket-2/data/</endpoint>
            <access_key_id>your_access_key_id</access_key_id>
            <secret_access_key>your_secret_access_key</secret_access_key>
        </s3_bucket2>
    </disks>
    <policies>
        <s3_multi_bucket>
            <volumes>
                <main>
                    <disk>s3_bucket1</disk>
                    <disk>s3_bucket2</disk>
                </main>
            </volumes>
        </s3_multi_bucket>
    </policies>
</storage_configuration>

You can also combine local and S3 volumes in a tiered policy, for example moving data from a local SSD to S3 as it ages:

<storage_configuration>
    <disks>
        <local_ssd>
            <path>/mnt/fast_ssd/clickhouse/</path>
        </local_ssd>
        <s3_cold>
            <type>s3</type>
            <endpoint>https://s3.amazonaws.com/cold-storage/data/</endpoint>
            <access_key_id>your_access_key_id</access_key_id>
            <secret_access_key>your_secret_access_key</secret_access_key>
        </s3_cold>
    </disks>
    <policies>
        <local_to_s3>
            <volumes>
                <hot>
                    <disk>local_ssd</disk>
                    <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </hot>
                <cold>
                    <disk>s3_cold</disk>
                </cold>
            </volumes>
            <move_factor>0.2</move_factor>
        </local_to_s3>
    </policies>
</storage_configuration>

Note

When using use_environment_credentials for S3 authentication, the environment credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) are shared across all S3 disks. It is not possible to use different environment credentials for different disks. If you need different credentials for each S3 disk, use explicit access_key_id and secret_access_key settings per disk instead.

It is possible to set up non-replicated MergeTree tables with a one-writer, many-readers scenario on shared storage. This is provided by the automatic refresh of the parts list, which can be set up on readers. Note that this requires shared filesystem metadata across replicas (or table_disk = true with a table-local disk). See refresh_parts_interval and table_disk.

cache configuration

ClickHouse versions 22.3 through 22.7 use a different cache configuration, see using local cache if you are using one of those versions.

Virtual columns

_part — Name of a part.
_part_index — Sequential index of the part in the query result.
_part_starting_offset — Cumulative starting row of the part in the query result.
_part_offset — Number of row in the part.
_part_granule_offset — Number of granule in the part.
_partition_id — Name of a partition.
_part_uuid — Unique part identifier (if enabled MergeTree setting assign_part_uuids).
_part_data_version — Data version of part (either min block number or mutation version).
_partition_value — Values (a tuple) of a partition by expression.
_sample_factor — Sample factor (from the query).
_block_number — Original number of block for row that was assigned at insert, persisted on merges when setting enable_block_number_column is enabled.
_block_offset — Original number of row in block that was assigned at insert, persisted on merges when setting enable_block_offset_column is enabled.
_disk_name — Disk name used for the storage.

Column statistics

The statistics declaration is in the columns section of the CREATE query for tables from the *MergeTree* Family:

CREATE TABLE tab
(
    a Int64 STATISTICS(tdigest, uniq),
    b Float64
)
ENGINE = MergeTree
ORDER BY a

We can also manipulate statistics with ALTER statements:

ALTER TABLE tab ADD STATISTICS b TYPE tdigest, uniq;
ALTER TABLE tab DROP STATISTICS a;

These lightweight statistics aggregate information about distribution of values in columns. Statistics are stored in every part and updated when every insert comes. They can be used for prewhere optimization only if we enable set use_statistics = 1.

Part Pruning with Statistics

When use_statistics_for_part_pruning is enabled, statistics can be used for part pruning. Currently, only MinMax and Basic statistics support part pruning. When such statistics are defined on a column, ClickHouse tracks the minimum and maximum values for that column in each part. Part pruning allows to skip reading entire data parts when the query filter condition cannot match any rows in that part.

Example:

-- Create a table with MinMax statistics on the 'value' column
CREATE TABLE test_stats
(
    id UInt64,
    value Int64 STATISTICS(minmax)
)
ENGINE = MergeTree
ORDER BY id;

SYSTEM STOP MERGES test_stats;

-- Insert data in separate inserts to create multiple parts
INSERT INTO test_stats SELECT number, number FROM numbers(1000); -- Part 1: value range [0, 999]
INSERT INTO test_stats SELECT number, number + 10000 FROM numbers(1000); -- Part 2: value range [10000, 10999]

SET use_statistics_for_part_pruning = 1;

-- This query will skip Part 1 entirely because its max value (999) < 5000
SELECT count() FROM test_stats WHERE value > 5000;

-- Use EXPLAIN to see the pruning effect
EXPLAIN indexes = 1 SELECT count() FROM test_stats WHERE value > 5000;
-- The output will show "Parts: 1/2" indicating one part was pruned

Available types of column statistics

basic

A compact bundle of single-value summaries derived from a column. Depending on the column type, the following pieces are populated:
- for any column whose values are represented by a number (integers, floats, Decimal*, Date*, DateTime*, Enum*, IPv4, ...): the minimum and maximum value, which allow to estimate the selectivity of range filters and enable part pruning;
- for String and FixedString columns: the total byte length of non-NULL values (from which the average string length can be derived);
- for Nullable and LowCardinality(Nullable) columns: the count of NULL values, which the optimizer uses to discount NULL rows from selectivity estimates.
  
  A single basic statistic can populate several of these at once — for example on a Nullable(UInt32) column it tracks both numeric min/max and the null count. Compared to minmax, basic additionally works on String / FixedString columns and can be declared on Nullable wrappers of types like UUID or IPv6 purely to track the null count.
minmax

The minimum and maximum column value which allows to estimate the selectivity of range filters on numeric columns.
tdigest

Note

Statistics of type tdigest have high creation costs and potentially slow down data ingest.

TDigest sketches which allow to compute approximate percentiles (e.g. the 90th percentile) for numeric columns.

uniq

BJKST sketches which provide an estimation how many distinct values a column contains. Internally uses uniq.
uniq_v2

Similar to uniq but internally uses uniqCombined(12) (a variant of HyperLogLog). Consumes less memory than uniq and can be build faster.
countmin

Note

Statistics of type countmin have high creation costs and potentially slow down data ingest.

CountMin sketches which provide an approximate count of the frequency of each value in a column.

Supported data types

	(U)Int, Float, Decimal(), Date, Boolean, Enum*	IPv4	String or FixedString
basic	✔	✔	✔
countmin	✔	✔	✔
minmax	✔	✔	✗
tdigest	✔	✗	✗
uniq	✔	✔	✔
uniq_v2	✔	✔	✔

All of the above also accept Nullable and LowCardinality(Nullable) wrappers of the listed types. Basic may additionally be declared on Nullable wrappers of types like UUID or IPv6 purely to track the null count.

Supported operations

	Equality filters (==)	Range filters (`>, >=, <, <=`)
basic	✗	✔ (numeric columns only)
countmin	✔	✗
minmax	✗	✔ (numeric columns only)
tdigest	✗	✔ (numeric columns only)
uniq	✔	✗
uniq_v2	✔	✗

For basic on String / FixedString columns the statistic only records the total non-NULL byte length (used to estimate average string length) and the null count; range filters and part pruning are not driven by it.

Column-level settings

Certain MergeTree settings can be overridden at column level:

max_compress_block_size — Maximum size of blocks of uncompressed data before compressing for writing to a table.
min_compress_block_size — Minimum size of blocks of uncompressed data required for compression when writing the next mark.

Example:

CREATE TABLE tab
(
    id Int64,
    document String SETTINGS (min_compress_block_size = 16777216, max_compress_block_size = 16777216)
)
ENGINE = MergeTree
ORDER BY id

Column-level settings can be modified or removed using ALTER MODIFY COLUMN, for example:

Remove SETTINGS from column declaration:

ALTER TABLE tab MODIFY COLUMN document REMOVE SETTINGS;

Modify a setting:

ALTER TABLE tab MODIFY COLUMN document MODIFY SETTING min_compress_block_size = 8192;

Reset one or more settings, also removes the setting declaration in the column expression of the table's CREATE query.

ALTER TABLE tab MODIFY COLUMN document RESET SETTING min_compress_block_size;

Creating tables​

Query clauses​

ENGINE​

ORDER BY​

PARTITION BY​

PRIMARY KEY​

SAMPLE BY​

TTL​

SETTINGS​

Data storage​

Primary Keys and Indexes in Queries​

Selecting a primary key​

Choosing a primary key that differs from the sorting key​

Use of indexes and partitions in queries​

Use of index for deterministic expressions in primary keys​

Use of index for partially-monotonic primary keys​

Data skipping indexes​

Skip Index Types​

MinMax skip index​

Set​

Bloom filter​

N-gram bloom filter (Deprecated)​

Token bloom filter​

Sparse grams bloom filter​

Text index​

Vector similarity​

Functions support​

Projections​

Projection query​

Projection indexes​

Syntax​

Index types​

Projection storage​

Query analysis​

Concurrent data access​

TTL for columns and tables​

Column TTL​

Creating a table with TTL:​

Adding TTL to a column of an existing table​

Altering TTL of the column​

Table TTL​

Creating a table with TTL:​

Altering TTL of the table:​

Creating a table, where expired rows are recompressed:​

Removing expired data​

Disk types​

Using multiple block devices for data storage​

Introduction​

Terms​

Configuration​

Details​

Using external storage for data storage​

Using S3 disks with multiple volumes​

Virtual columns​

Column statistics​

Part Pruning with Statistics​

Available types of column statistics​

Supported data types​

Supported operations​

Column-level settings​

Creating tables

Query clauses

ENGINE

ORDER BY

PARTITION BY

PRIMARY KEY

SAMPLE BY

TTL

SETTINGS

Data storage

Primary Keys and Indexes in Queries

Selecting a primary key

Choosing a primary key that differs from the sorting key

Use of indexes and partitions in queries

Use of index for deterministic expressions in primary keys

Use of index for partially-monotonic primary keys

Data skipping indexes

Skip Index Types

MinMax skip index

Set

Bloom filter

N-gram bloom filter (Deprecated)

Token bloom filter

Sparse grams bloom filter

Text index

Vector similarity

Functions support

Projections

Projection query

Projection indexes

Syntax

Index types

Projection storage

Query analysis

Concurrent data access

TTL for columns and tables

Column TTL

Creating a table with `TTL`:

Adding TTL to a column of an existing table

Altering TTL of the column

Table TTL

Creating a table with `TTL`:

Altering `TTL` of the table:

Creating a table, where expired rows are recompressed:

Removing expired data

Disk types

Using multiple block devices for data storage

Introduction

Terms

Configuration

Details

Using external storage for data storage

Using S3 disks with multiple volumes

Virtual columns

Column statistics

Part Pruning with Statistics

Available types of column statistics

Supported data types

Supported operations

Column-level settings