Late materialization of secondary indices
Mark Needham
Ever wondered how ClickHouse handles the creation of secondary indexes during data ingestion? In ClickHouse 25.10, a new feature called late materialization of secondary indices is introduced, providing more control over when indexes are built.
This can be a game-changer when working with computationally expensive indexes, such as vector similarity indexes, where you may want to load your data first and build the index later, rather than incurring the performance cost upfront.
In this tutorial, we use Wikipedia embeddings and vector similarity indexes to demonstrate the performance difference between immediate and late materialization. You'll see how inserting 150,000 rows takes 26 seconds with immediate index creation versus just 2 seconds when deferring it, and how query performance compares before and after materialization.
- Setting up vector similarity indexes with HNSW algorithm and L2 distance
- Using
exclude_materialize_skip_indexes_on_insertto defer index creation during data loading - Comparing query performance with and without materialized indexes (8x speedup demonstrated)
- Controlling index materialization behavior during merges with configuration settings
- Manually triggering index materialization with ALTER TABLE commands
Parquet files with embeddings - https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M

Scaling ClickHouse to petabytes of logs at OpenAI

How ClickHouse helps Anthropic scale observability

How Capital One Slingshot cut infrastructure costs by 50%
Engineering leaders at Capital One Slingshot share how they cut infrastructure costs by 50% and reduced average dashboard load time from 5+ to under 500ms with ClickHouse Cloud.