Intro to Full-Text Search in ClickHouse

Mark Needham

ClickHouse 26.2 brings the text index to production status, and in this video we take a proper look at how it works. It's implemented as an inverted index - similar to what you'd find in Lucene - so it stores a mapping from tokens to row numbers, letting the query engine skip most of the data when searching text columns.

We run through the whole thing using a 20 million row subset of the GitHub events dataset:

  • How the text index works under the hood (inverted index, tokenizers, preprocessors)
  • Creating tables with and without the index and comparing insert performance and disk usage
  • Writing queries using hasToken, hasAllTokens, and hasAnyTokens
  • Using EXPLAIN to see granule skipping in action — from 2504 granules down to 28
  • A real-world example: GitTrends, built on the full 10 billion row dataset