Why columnar databases are fast

Columnar databases beat row stores by 10× to 1000× on analytical queries because two principles compound. The first is efficient execution: storage layout and vectorised CPU loops do each unit of work as fast as possible. The second is smart pruning: data skipping and late materialisation cut the amount of work the query has to do at all. ClickHouse, DuckDB, and Snowflake all stack both.

TL;DR #

Columnar storage reads only the columns a query references, often cutting raw I/O by 90%+ on wide tables.
Vectorised query execution runs operators on batches of values, amortising interpreter overhead and unlocking SIMD.
Data skipping (zone maps, Bloom filters, sparse indexes) prunes whole blocks before the engine ever decompresses them.
Late materialisation defers row assembly until after filters run, cutting downstream column work by 10-1000× on selective queries.
Each mechanism multiplies the others.

Why are columnar databases faster than row stores for analytics? #

Columnar databases are faster because they read fewer bytes per query, process more values per cycle, and skip data before touching it. A row store walks every byte of every scanned row. A column store reads only referenced columns, processes them in dense batches, and uses precomputed metadata to prune blocks before opening them.

Each mechanism contributes significantly to the efficieny of columnar databases - when combined, they multiply. A query that reads 3 of 50 columns saves ~94% of I/O, then processes the remaining bytes 10× faster per cycle, then prunes 90% of those blocks via skipping indexes. Stack the multipliers and modest queries land 100×-1000x ahead of a row store on the same hardware.

The mechanisms split into two categories. Efficiency covers how each unit of work runs faster: columnar storage and vectorised execution. Smart pruning covers how the engine avoids work entirely: data skipping indexes and late materialisation.

How does columnar storage make queries faster? #

Columnar storage packs same-type values from one column into contiguous blocks on disk. A query that touches 3 of 50 columns reads ~6% of the bytes a row store reads for the same rows. The savings extend through the page cache and CPU cache: a sequential read fills cache lines with bytes the query actually uses, instead of mixed columns the query will throw away.

The layout also enables much higher compression ratios. Adjacent same-type values dictionary-encode, run-length-encode, or delta-encode 5-10× tighter than mixed rows. A low-cardinality column can hit 30×. Read 8% of the columns at 8× compression and you have cut raw bytes by ~99%.

Layer	Row store	Column store
Disk read	Every byte of every scanned row	Only referenced columns
Cache locality	Tuples scatter across cache lines	One column fills cache lines tightly
Compression	1.5-3× typical	5-10×, up to 30× on low cardinality

The full encoding stack (dictionary, RLE, delta, Gorilla, plus a general codec like LZ4 or ZSTD) lives in how columnar compression works. The on-disk shape itself is covered in what is columnar storage.

How does vectorised execution make queries faster? #

Vectorised execution processes batches of 1024-4096 values per operator call instead of one row at a time. This amortises interpreter dispatch, keeps tight working sets in CPU cache, and lets the compiler emit SIMD instructions that handle 8-16 values per CPU cycle. The MonetDB/X100 paper (Boncz et al., CIDR 2005) showed that swapping single-row iteration for batched columnar loops eliminated most of the dispatch cost that dominates row-at-a-time engines.

The columnar layout is what makes this practical. Values of one column are already contiguous and identically typed, so a tight loop loads 256-bit or 512-bit chunks straight into a register without unpacking. ClickHouse, DuckDB, Snowflake, Databricks Photon, DataFusion, and Velox all ship variants of this model. Implementation details, vector sizes, and SIMD dispatch strategies are covered in vectorised query execution.

How do columnar engines avoid unnecessary work? #

Two techniques let the engine skip work entirely. Data skipping indexes store precomputed metadata per block (min/max, Bloom filters, sparse marks) so the planner can prune blocks before reading them. Late materialisation defers row assembly until after filters run, so the engine only fetches non-predicate columns for surviving rows.

Pruning mechanism	What it stores	Best for
Min/max (zone map)	Min and max per block	Range filters on sorted or correlated columns
Sparse primary index	One mark every N rows	Range filters on the primary key
Bloom filter	Probabilistic set membership	Equality on high-cardinality columns
Set index	Distinct values per block	Low-cardinality equality filters

A time-series table sorted by (tenant_id, timestamp) will skip 99%+ of blocks for a typical recent-activity query. Late materialisation then takes the surviving row positions and only fetches the SELECT columns for those. On a 1-in-1000 filter, that cuts downstream column work by ~1000×. The two mechanisms are independent: pruning reduces the bytes the engine reads, materialisation reduces the bytes it processes after reading.

How does ClickHouse implement these techniques? #

ClickHouse's official "why is ClickHouse so fast?" guide and academic overview give the long-form answer. The short version: ClickHouse stacks every mechanism above and tunes each one aggressively.

MergeTree storage. Data is written as immutable parts, each split into granules of ~8192 rows. Each granule has a sparse primary-key mark, so range scans on a sorted column can skip almost every block.
PREWHERE. An explicit late-materialisation hint. ClickHouse reads the predicate column, evaluates the filter, then fetches the SELECT columns only for surviving rows. Modern versions move predicates into PREWHERE automatically when selectivity warrants.
Vectorised execution with CPU dispatch. Hot loops compile multiple times (SSE4.2, AVX2, AVX-512) and the binary picks the variant at startup based on cpuid. See the CPU dispatch blog.
Per-column codecs. Columns can specify their own codec stack (Delta + LZ4 for timestamps, Gorilla for floats), trading CPU for size on the data shape that suits each column.
Skip indexes. Beyond the sparse primary index, secondary skip indexes (min/max, set, Bloom filter, n-gram) prune blocks for non-primary columns.

When are columnar databases slower than row stores? #

The same layout that wins on wide aggregations is what makes single-row work expensive: every column file must be opened to reconstruct one row. Columnar engines lose to row stores on point lookups, per-row mutations, and high-concurrency transactional writes.

For mixed workloads, a hybrid pattern (Postgres for OLTP plus ClickHouse for analytics) is usually the right shape rather than forcing one layout to do both jobs. Cross-engine numbers, including the cases where row stores outperform column stores, live in the row vs column benchmark.

Frequently asked questions

Is vectorised execution the same as SIMD?

No. Vectorisation is a batch processing model where operators receive 1024-4096 values per call. SIMD is one implementation detail: the CPU instruction set that processes multiple values per instruction. Vectorised engines often compile to SIMD, but the speedup from amortising interpreter overhead is independent and usually larger than the SIMD speedup alone.

Does compression actually help if it adds CPU work?

Yes. Compression saves disk I/O and memory bandwidth. The decompression cost is paid once per block, and the savings apply across every operator that reads the block. LZ4 decompresses at multi-GB/s on a modern core, so the I/O saved nearly always exceeds the CPU spent.

Why is ClickHouse's PREWHERE different from a normal WHERE?

PREWHERE is an explicit late-materialisation hint. ClickHouse evaluates the PREWHERE predicate first on its column only, then fetches the remaining columns for surviving rows. Modern versions move predicates into PREWHERE automatically for selective filters, but the manual form remains useful when the planner's selectivity estimate is wrong.

Are these speedups specific to ClickHouse?

No. ClickHouse leads in performance by focusing on implementing all optimisations possible, but most columnar databases implement a subset of the same mechanisms.