Late materialization of secondary indices

Mark Needham

Ever wondered how ClickHouse handles the creation of secondary indexes during data ingestion? In ClickHouse 25.10, a new feature called late materialization of secondary indices is introduced, providing more control over when indexes are built.

This can be a game-changer when working with computationally expensive indexes, such as vector similarity indexes, where you may want to load your data first and build the index later, rather than incurring the performance cost upfront.

In this tutorial, we use Wikipedia embeddings and vector similarity indexes to demonstrate the performance difference between immediate and late materialization. You'll see how inserting 150,000 rows takes 26 seconds with immediate index creation versus just 2 seconds when deferring it, and how query performance compares before and after materialization.

  • Setting up vector similarity indexes with HNSW algorithm and L2 distance
  • Using exclude_materialize_skip_indexes_on_insert to defer index creation during data loading
  • Comparing query performance with and without materialized indexes (8x speedup demonstrated)
  • Controlling index materialization behavior during merges with configuration settings
  • Manually triggering index materialization with ALTER TABLE commands

Parquet files with embeddings - https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M

Recent videos

YouTube Video: 8Avld0BlDlg

Open House, User stories

How Capital One cut infrastructure costs by 50%

Engineering leaders at Capital One share how they cut infrastructure costs by 50% and reduced average dashboard load time from 5+ to under 500ms with ClickHouse Cloud.

Follow us

XBlueskySlackGithubTelegramMeetupRSS