Full-text search using text indexes
Text indexes in ClickHouse (also known as "inverted indexes") provide fast full-text capabilities on string data. The index maps each token in the column to the rows which contain the token. The tokens are generated by a process called tokenization. For example, ClickHouse tokenizes the English sentence "All cat like mice." by default as ["All", "cat", "like", "mice"] (note that the trailing dot is ignored). More advanced tokenizers are available, for example for log data.
Creating a Text Index
To create a text index, first enable the corresponding experimental setting:
A text index can be defined on a String, FixedString, Array(String), Array(FixedString), and Map (via mapKeys and mapValues map functions) column using the following syntax:
Tokenizer argument (mandatory). The tokenizer argument specifies the tokenizer:
splitByNonAlphasplits strings along non-alphanumeric ASCII characters (also see function splitByNonAlpha).splitByString(S)splits strings along certain user-defined separator stringsS(also see function splitByString). The separators can be specified using an optional parameter, for example,tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', 'in the example). The default separator list, if not specified explicitly (for example,tokenizer = splitByString), is a single whitespace[' '].ngrams(N)splits strings into equally largeN-grams (also see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example,tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example,tokenizer = ngrams), is 3.sparseGrams(min_length, max_length, min_cutoff_length)splits strings into variable-length n-grams of at leastmin_lengthand at mostmax_length(inclusive) characters (also see function sparseGrams). Unless specified explicitly,min_lengthandmax_lengthdefault to 3 and 100. If parametermin_cutoff_lengthis provided, only n-grams with length greater or equal thanmin_cutoff_lengthare returned. Compared tongrams(N), thesparseGramstokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example,tokenizer = sparseGrams(3, 5, 4)internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.arrayperforms no tokenization, i.e. every row value is a token (also see function array).
The splitByString tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc'].
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.
It is at the moment not recommended to build text indexes on top of text in non-western languages, e.g. Chinese. The currently supported tokenizers may lead to huge index sizes and large query times. We plan to add specialized language-specific tokenizers in future which will handle these cases better.
To test how the tokenizers split the input string, you can use ClickHouse's tokens function:
Example:
Result:
Preprocessor argument (optional). The argument preprocessor is an expression which is applied to the input string before tokenization.
Typical use cases for the preprocessor argument include
- Lower- or upper-casing to enable case-insensitive matching, e.g., lower, lowerUTF8, see the first example below.
- UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, toValidUTF8.
- Removing or transforming unwanted characters or substrings, e.g. extractTextFromHTML, substring, idnaEncode.
The preprocessor expression must transform an input value of type String or FixedString to a value of the same type.
Examples:
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col))
Also, the preprocessor expression must only reference the column on top of which the text index is defined. Using non-deterministic functions is not allowed.
Functions hasToken, hasAllTokens and hasAnyTokens use the preprocessor to first transform the search term before tokenizing it.
For example,
is equivalent to:
Other arguments (optional). Text indexes in ClickHouse are implemented as secondary indexes. However, unlike other skipping indexes, text indexes have an infinite granularity, i.e. the text index is created for the entire part and explicitly specified index granularity is ignored. This value has been chosen empirically and it provides a good trade-off between speed and index size for most use cases. Advanced users can specify a different index granularity (we do not recommend this).
Optional advanced parameters
The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.
Optional parameter dictionary_block_size (default: 512) specifies the size of dictionary blocks in rows.
Optional parameter dictionary_block_frontcoding_compression (default: 1) specifies if the dictionary blocks use front coding as compression.
Optional parameter posting_list_block_size (default: 1048576) specifies the size of posting list blocks in rows.
Text indexes can be added to or removed from a column after the table has been created:
Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists, below string search functions will fall back to slow brute-force scans.
Supported functions
The text index can be used if text functions are used in the WHERE clause or PREWHERE clauses:
= and !=
= (equals) and != (notEquals) match the entire given search term.
Example:
The text index supports = and !=, yet equality and inequality search only make sense with the array tokenizer (it causes the index to store entire row values).
IN and NOT IN
IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or no (NOT IN) search terms
Example:
The same restrictions as for = and != apply, i.e. IN and NOT IN only make sense in conjunction with the array tokenizer.
LIKE, NOT LIKE and match
These functions currently use the text index for filtering only if the index tokenizer is either splitByNonAlpha, ngrams or sparseGrams.
In order to use LIKE (like), NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
For the index with ngrams tokenizer, it is the case if the searched strings between special characters are longer or equal to the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
support in the example could match support, supports, supporting etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
The spaces left and right of support make sure that the term can be extracted as a token.
startsWith and endsWith
Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
For text index with ngrams tokenizer, it is the case if the searched prefix or suffix is longer or equal to the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
In the example, only clickhouse is considered a token.
support is no token because it can match support, supports, supporting etc.
To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:
Similarly, endsWith should be used with a leading space:
hasToken and hasTokenOrNull
Functions hasToken and hasTokenOrNull match against a single given token.
Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).
Example:
Functions hasToken and hasTokenOrNull are the most performant functions to use with the text index.
hasAnyTokens and hasAllTokens
Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens.
These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching. See the function documentation for more info.
Example:
has
Array function has matches against a single token in the array of strings.
Example:
mapContains
Function mapContains (an alias of mapContainsKey) matches against tokens extracted from the searched string in the keys of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it is created on mapKeys(map) expression.
Example:
mapContainsValue
Function mapContainsValue matches against tokens extracted from the searched string in the values of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it is created on mapValues(map) expression.
Example:
mapContainsKeyLike and mapContainsValueLike
The functions mapContainsKeyLike and mapContainsValueLike match a pattern against all keys or values (respectively) of a map.
Example:
operator[]
Access operator[] can be used with the text index to filter out keys and values. The text index is only used if it is created on mapKeys(map) or mapValues(map) expressions, or both.
Example:
See the following examples for using columns of type Array(T) and Map(K, V) with the text index.
Examples for Array and Map columns with text indexes
Indexing Array(String) columns
Imagine a blogging platform, where authors categorize their blog posts using keywords. We like users to discover related content by searching for or clicking on topics.
Consider this table definition:
Without a text index, finding posts with a specific keyword (e.g. clickhouse) requires scanning all entries:
As the platform grows, this becomes increasingly slow because the query must examine every keywords array in every row.
To overcome this performance issue, we define a text index for column keywords:
Indexing Map columns
In many observability use cases, log messages are split into "components" and stored as appropriate data types, e.g. date time for the timestamp, enum for the log level etc. Metrics fields are best stored as key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring.
Consider this logs table:
Without a text index, searching through Map data requires full table scans:
As log volume grows, these queries become slow.
The solution is creating a text index for the Map keys and values. Use mapKeys to create a text index when you need to find logs by field names or attribute types:
Use mapValues to create a text index when you need to search within the actual content of attributes:
Example queries:
Performance Tuning
Direct read
Certain types of text queries can be speed up significantly by an optimization called "direct read".
Example:
The direct read optimization in ClickHouse answers the query exclusively using the text index (i.e., text index lookups) without accessing the underlying text column. Text index lookups read relatively little data and are therefore much faster than usual skip indexes in ClickHouse (which do a skip index lookup, followed by loading and filtering remaining granules).
Direct read is controlled by two settings:
- Setting query_plan_direct_read_from_text_index which specifies if direct read is generally enabled.
- Setting use_skip_indexes_on_data_read which is another prerequisite for direct read. Note that on ClickHouse databases with compatibility < 25.10,
use_skip_indexes_on_data_readis disabled, so you either need to raise the compatibility setting value orSET use_skip_indexes_on_data_read = 1explicitly.
Also, the text index must be fully materialized to use direct reading (use ALTER TABLE ... MATERIALIZE INDEX for that).
Supported functions
The direct read optimization supports functions hasToken, hasAllTokens, and hasAnyTokens.
If text index is created with an array tokenizer, the direct read is also supported for functions equals, has, mapContainsKey, and mapContainsValue.
These functions can also be combined by AND, OR, and NOT operators.
The WHERE or PREWHERE clauses can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions).
To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1.
As an example, a query with disabled direct read
returns
whereas the same query run with query_plan_direct_read_from_text_index = 1
returns
The second EXPLAIN PLAN output contains a virtual column __text_index_<index_name>_<function_name>_<id>.
If this column is present, then direct read is used.
The performance benefit of the direct read optimization is greatest when the text column is used exclusively within text search functions, as this allows the query to avoid reading the column data entirely. However, even if the text column is accessed elsewhere in the query and must be read, the direct read optimization will still provide performance improvement.
Direct read as a hint
Direct read as a hint uses the same principles as normal direct read, but instead adds an additional filter build from the text index data without removing the underlying text column. It is used for functions when accessing only the text index may produce false positives.
Supported functions are: like, startsWith, endsWith, equals, has, mapContainsKey, and mapContainsValue.
A hint filter can provide additional selectivity to further restrict the result set in combination with other filters, helping to reduce the amount of data read from other columns.
Direct read as a hint is controlled by setting query_plan_text_index_add_hint (enabled by default).
Example of query without hint:
returns
whereas the same query run with query_plan_text_index_add_hint = 1
returns
In the second EXPLAIN PLAN output, you can see that an additional conjunct (__text_index_...) has been added to the filter condition. Thanks to the PREWHERE optimization, the filter condition is broken down into three separate conjuncts, which are applied in order of increasing computational complexity. For this query, the application order is __text_index_..., then greaterOrEquals(...), and finally like(...). This ordering enables skipping even more data granules than the granules skipped by the text index and the original filter, before reading the heavy columns used in the query after WHERE clause further reducing the amount of data to read.
Caching
Different caches are available to buffer parts of the text index in memory (see section Implementation Details): Currently, there are caches for the deserialized dictionary blocks, headers and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_dictionary_cache, use_text_index_header_cache, and use_text_index_postings_cache. By default, all caches are disabled. To drop the caches, use statement SYSTEM DROP TEXT INDEX CACHES
Please refer the following server settings to configure the caches.
Dictionary blocks cache settings
| Setting | Description |
|---|---|
| text_index_dictionary_block_cache_policy | Text index dictionary block cache policy name. |
| text_index_dictionary_block_cache_size | Maximum cache size in bytes. |
| text_index_dictionary_block_cache_max_entries | Maximum number of deserialized dictionary blocks in cache. |
| text_index_dictionary_block_cache_size_ratio | The size of the protected queue in the text index dictionary block cache relative to the cache's total size. |
Header cache settings
| Setting | Description |
|---|---|
| text_index_header_cache_policy | Text index header cache policy name. |
| text_index_header_cache_size | Maximum cache size in bytes. |
| text_index_header_cache_max_entries | Maximum number of deserialized headers in cache. |
| text_index_header_cache_size_ratio | The size of the protected queue in the text index header cache relative to the cache's total size. |
Posting lists cache settings
| Setting | Description |
|---|---|
| text_index_postings_cache_policy | Text index postings cache policy name. |
| text_index_postings_cache_size | Maximum cache size in bytes. |
| text_index_postings_cache_max_entries | Maximum number of deserialized postings in cache. |
| text_index_postings_cache_size_ratio | The size of the protected queue in the text index postings cache relative to the cache's total size. |
Implementation Details
Each text index consists of two (abstract) data structures:
- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
Text index is built for the whole part. Unlike other skip indexes, text index can be merged instead of rebuilt on merge of the data parts.
During index creation, three files are created (per part):
Dictionary blocks file (.dct)
The tokens in the text index are sorted and stored in dictionary blocks of 512 tokens each (the block size is configurable by parameter dictionary_block_size).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index header file (.idx)
The index header file contains for each dictionary block the block's first token and its relative offset in the dictionary blocks file.
This sparse index structure is similar to ClickHouse's sparse primary key index).
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the posting list is larger than posting_list_block_size, it is split into multiple blocks that are stored sequentially to the postings lists file.
Merging of text indexes
When data parts are merged, the text index does not need to be rebuilt from scratch; instead, it can be merged efficiently in a separate step of the merge process. During this step, the sorted dictionaries from each part are read and combined into a new unified dictionary. The row numbers in the postings lists are also recalculated to reflect their new positions in the merged data part, using a mapping of old to new row numbers that is created during the initial merge phase. This method of merging text indexes is similar to how projections with _part_offset column are merged. If index is not materialized in the source part, it is built, written into a temporary file and then merged together with indexes from the other parts and from other temporary index files.
Example: Hackernews dataset
Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:
The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews table:
We will use ALTER TABLE and add a text index on comment column, then materialize it:
Now, let's run queries using hasToken, hasAnyTokens, and hasAllTokens functions.
The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.
1. Using hasToken
hasToken checks if the text contains a specific single token.
We'll search for the case-sensitive token 'ClickHouse'.
Direct read disabled (Standard scan) By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules. We can simulate this behavior by disabling direct read.
Direct read enabled (Fast index read) Now we run the same query with direct read enabled (the default).
The direct read query is over 45 times faster (0.362s vs 0.008s) and processes significantly less data (9.51 GB vs 3.15 MB) by reading from the index alone.
2. Using hasAnyTokens
hasAnyTokens checks if the text contains at least one of the given tokens.
We'll search for comments containing either 'love' or 'ClickHouse'.
Direct read disabled (Standard scan)
Direct read enabled (Fast index read)
The speedup is even more dramatic for this common "OR" search. The query is nearly 89 times faster (1.329s vs 0.015s) by avoiding the full column scan.
3. Using hasAllTokens
hasAllTokens checks if the text contains all of the given tokens.
We'll search for comments containing both 'love' and 'ClickHouse'.
Direct read disabled (Standard scan) Even with direct read disabled, the standard skip index is still effective. It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.
Direct read enabled (Fast index read) Direct read answers the query by operating on the index data, reading only 147.46 KB.
For this "AND" search, the direct read optimization is over 26 times faster (0.184s vs 0.007s) than the standard skip index scan.
4. Compound search: OR, AND, NOT, ...
The direct read optimization also applies to compound boolean expressions. Here, we'll perform a case-insensitive search for 'ClickHouse' OR 'clickhouse'.
Direct read disabled (Standard scan)
Direct read enabled (Fast index read)
By combining the results from the index, the direct read query is 34 times faster (0.450s vs 0.013s) and avoids reading the 9.58 GB of column data.
For this specific case, hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.