Columnar databases beat row stores by 10× to 1000× on analytical queries because two principles compound. The first is efficient execution: storage layout and vectorised CPU loops do each unit of work as fast as possible. The second is smart pruning: data skipping and late materialisation cut the amount of work the query has to do at all. ClickHouse, DuckDB, and Snowflake all stack both.
TL;DR #
- Columnar storage reads only the columns a query references, often cutting raw I/O by 90%+ on wide tables.
- Vectorised query execution runs operators on batches of values, amortising interpreter overhead and unlocking SIMD.
- Data skipping (zone maps, Bloom filters, sparse indexes) prunes whole blocks before the engine ever decompresses them.
- Late materialisation defers row assembly until after filters run, cutting downstream column work by 10-1000× on selective queries.
- Each mechanism multiplies the others.
Why are columnar databases faster than row stores for analytics? #
Columnar databases are faster because they read fewer bytes per query, process more values per cycle, and skip data before touching it. A row store walks every byte of every scanned row. A column store reads only referenced columns, processes them in dense batches, and uses precomputed metadata to prune blocks before opening them.
Each mechanism contributes significantly to the efficieny of columnar databases - when combined, they multiply. A query that reads 3 of 50 columns saves ~94% of I/O, then processes the remaining bytes 10× faster per cycle, then prunes 90% of those blocks via skipping indexes. Stack the multipliers and modest queries land 100×-1000x ahead of a row store on the same hardware.
The mechanisms split into two categories. Efficiency covers how each unit of work runs faster: columnar storage and vectorised execution. Smart pruning covers how the engine avoids work entirely: data skipping indexes and late materialisation.
How does columnar storage make queries faster? #
Columnar storage packs same-type values from one column into contiguous blocks on disk. A query that touches 3 of 50 columns reads ~6% of the bytes a row store reads for the same rows. The savings extend through the page cache and CPU cache: a sequential read fills cache lines with bytes the query actually uses, instead of mixed columns the query will throw away.
The layout also enables much higher compression ratios. Adjacent same-type values dictionary-encode, run-length-encode, or delta-encode 5-10× tighter than mixed rows. A low-cardinality column can hit 30×. Read 8% of the columns at 8× compression and you have cut raw bytes by ~99%.
| Layer | Row store | Column store |
|---|---|---|
| Disk read | Every byte of every scanned row | Only referenced columns |
| Cache locality | Tuples scatter across cache lines | One column fills cache lines tightly |
| Compression | 1.5-3× typical | 5-10×, up to 30× on low cardinality |
The full encoding stack (dictionary, RLE, delta, Gorilla, plus a general codec like LZ4 or ZSTD) lives in how columnar compression works. The on-disk shape itself is covered in what is columnar storage.
How does vectorised execution make queries faster? #
Vectorised execution processes batches of 1024-4096 values per operator call instead of one row at a time. This amortises interpreter dispatch, keeps tight working sets in CPU cache, and lets the compiler emit SIMD instructions that handle 8-16 values per CPU cycle. The MonetDB/X100 paper (Boncz et al., CIDR 2005) showed that swapping single-row iteration for batched columnar loops eliminated most of the dispatch cost that dominates row-at-a-time engines.
The columnar layout is what makes this practical. Values of one column are already contiguous and identically typed, so a tight loop loads 256-bit or 512-bit chunks straight into a register without unpacking. ClickHouse, DuckDB, Snowflake, Databricks Photon, DataFusion, and Velox all ship variants of this model. Implementation details, vector sizes, and SIMD dispatch strategies are covered in vectorised query execution.
How do columnar engines avoid unnecessary work? #
Two techniques let the engine skip work entirely. Data skipping indexes store precomputed metadata per block (min/max, Bloom filters, sparse marks) so the planner can prune blocks before reading them. Late materialisation defers row assembly until after filters run, so the engine only fetches non-predicate columns for surviving rows.
| Pruning mechanism | What it stores | Best for |
|---|---|---|
| Min/max (zone map) | Min and max per block | Range filters on sorted or correlated columns |
| Sparse primary index | One mark every N rows | Range filters on the primary key |
| Bloom filter | Probabilistic set membership | Equality on high-cardinality columns |
| Set index | Distinct values per block | Low-cardinality equality filters |
A time-series table sorted by (tenant_id, timestamp) will skip 99%+ of blocks for a typical recent-activity query. Late materialisation then takes the surviving row positions and only fetches the SELECT columns for those. On a 1-in-1000 filter, that cuts downstream column work by ~1000×. The two mechanisms are independent: pruning reduces the bytes the engine reads, materialisation reduces the bytes it processes after reading.
How does ClickHouse implement these techniques? #
ClickHouse's official "why is ClickHouse so fast?" guide and academic overview give the long-form answer. The short version: ClickHouse stacks every mechanism above and tunes each one aggressively.
- MergeTree storage. Data is written as immutable parts, each split into granules of ~8192 rows. Each granule has a sparse primary-key mark, so range scans on a sorted column can skip almost every block.
- PREWHERE. An explicit late-materialisation hint. ClickHouse reads the predicate column, evaluates the filter, then fetches the SELECT columns only for surviving rows. Modern versions move predicates into PREWHERE automatically when selectivity warrants.
- Vectorised execution with CPU dispatch. Hot loops compile multiple times (SSE4.2, AVX2, AVX-512) and the binary picks the variant at startup based on
cpuid. See the CPU dispatch blog. - Per-column codecs. Columns can specify their own codec stack (Delta + LZ4 for timestamps, Gorilla for floats), trading CPU for size on the data shape that suits each column.
- Skip indexes. Beyond the sparse primary index, secondary skip indexes (min/max, set, Bloom filter, n-gram) prune blocks for non-primary columns.
When are columnar databases slower than row stores? #
The same layout that wins on wide aggregations is what makes single-row work expensive: every column file must be opened to reconstruct one row. Columnar engines lose to row stores on point lookups, per-row mutations, and high-concurrency transactional writes.
For mixed workloads, a hybrid pattern (Postgres for OLTP plus ClickHouse for analytics) is usually the right shape rather than forcing one layout to do both jobs. Cross-engine numbers, including the cases where row stores outperform column stores, live in the row vs column benchmark.