Columnar storage formats define how column-oriented data is laid out, separately from the engines that read it. Apache Parquet (released July 2013) is the cloud-lake default. Apache ORC (released with Hive 0.12 in October 2013) is Hive-native with deeper per-stripe statistics. Apache Arrow (released February 2016) is an in-memory layout for zero-copy interchange. They are file or memory layouts, not databases.
TL;DR #
- Apache Parquet (July 2013) and Apache ORC (October 2013) are on-disk columnar file formats. Apache Arrow (February 2016) is an in-memory layout for cross-process exchange.
- Parquet groups data into row groups of 64–512 MB, each holding one column chunk per column, with per-row-group statistics in the file footer for predicate pushdown.
- ORC stores data in stripes typically around 200 MB, with statistics at file, stripe, and row-group-within-stripe levels (three layers of skip metadata).
- Arrow defines RecordBatches with columnar buffers laid out for SIMD. Spark, DuckDB, Polars, and pandas use Arrow at the wire level, but rarely as long-term storage.
- Native engine formats (ClickHouse MergeTree parts, Snowflake micro-partitions, DuckDB's native database file) typically read fewer bytes than Parquet on the same data because the engine controls sort order, sparse indexes, and skip indexes.
- Iceberg, Delta Lake, and Apache Hudi are table formats on top of Parquet (or ORC/Avro). They add transactions and schema evolution. The data is still Parquet on disk.
What are columnar storage formats? #
Columnar storage formats specify how column-oriented data is organised in files or memory buffers. Each column's values are stored contiguously, with metadata describing types, compression, and statistics. The format is read by a query engine. Parquet, ORC, and Arrow are formats; ClickHouse, DuckDB, and Snowflake are engines that consume them.
The distinction matters because formats and engines evolve on separate cadences. A Parquet file can be read by Spark, Trino, Snowflake, DuckDB, Athena, BigQuery, and ClickHouse. The engine adds query planning, indexes beyond per-block statistics, and a query runtime. The format defines only the bytes on disk and how to decode them. The same separation is described in the Apache Arrow columnar format spec, which positions Arrow as the in-memory counterpart to on-disk formats like Parquet.
Two layout decisions define every columnar format: how to chunk a table horizontally so a reader can parallelise and skip work, and how to encode each column's values for size and decoding speed. Parquet, ORC, and Arrow make different choices on both axes, which is why they coexist instead of converging. For a primer on the underlying layout, see what is columnar storage. For the broader category, see the columnar database overview.
What is Apache Parquet? #
Apache Parquet is an on-disk columnar file format released in July 2013 by Twitter and Cloudera, inspired by Google's Dremel paper. A Parquet file is split into row groups (typically 64–512 MB), each row group holds one column chunk per column, and each column chunk is split into pages. Per-row-group statistics (min, max, null counts) sit in the file footer and drive predicate pushdown.
Parquet 2.x added page indexes, Bloom filters, and column indexes that store per-page statistics, narrowing what was previously ORC's clearest advantage. The default compression codec is Snappy, with ZSTD as the common alternative for archival workloads, and LZ4, GZIP, and Brotli also supported. Encodings include dictionary, RLE/bit-packing, and delta. The format is encoding-rich enough that compression ratios commonly land in the 5–10× band, with column-specific cases beyond 30×.
Parquet's reach is the reason it became the cloud-lake default. Spark, Trino, Presto, Hive, ClickHouse, DuckDB, Polars, BigQuery, Snowflake, Athena, and Redshift all read Parquet directly. ClickHouse's Parquet integration reads and writes partitioned Parquet datasets directly from S3-compatible object storage.
What is Apache ORC? #
Apache ORC (Optimized Row Columnar) is an on-disk format released with Apache Hive 0.12 in October 2013, designed to replace Hive's earlier RCFile. An ORC file is split into stripes, each stripe contains the row data plus a stripe footer with column-level statistics, and the file footer holds aggregated statistics across stripes. The ORC v1 specification notes that stripes are "typically ~200MB". The Hive default orc.stripe.size is 64 MB, but production configurations commonly tune this higher.
ORC stores statistics at three levels: file, stripe, and row group within a stripe (default 10,000 rows per row-group index entry per the spec). That depth is the basis of its predicate-pushdown advantage. Default compression is Zlib, with Snappy and ZSTD as alternatives. The format supports ACID transactions via delta files: updates land in delta files alongside the base files, and a compactor merges them periodically. This is what makes ORC the natural format for Hive ACID tables.
ORC's ecosystem is narrower than Parquet's. Inside the Hive/HDFS world it is the dominant format. Outside, support is thinner. Spark, Trino, and Presto read ORC well, while ClickHouse, DuckDB, BigQuery, and Snowflake have varying levels of support. The common pattern is Hive-on-ORC for legacy batch workloads, Parquet everywhere else.
What is Apache Arrow? #
Apache Arrow is an in-memory columnar format released in February 2016, designed for zero-copy data exchange between processes. The Arrow columnar format spec defines RecordBatches as ordered collections of arrays. Each field is a contiguous values buffer plus a validity bitmap, with no per-row overhead and a layout aligned for SIMD instructions.
The critical distinction from Parquet and ORC is that Arrow is in-memory, not on-disk. The Arrow IPC format and Feather V2 do exist as on-disk wrappers, but those are uncommon for production storage. Arrow's primary job is to live in RAM and move between processes. When DuckDB hands a result back to pandas, when Spark passes data to a Python UDF, when Polars exchanges with a JVM client, the layout is Arrow. Wes McKinney's response to Daniel Abadi's columnar comparison frames Arrow as complementary to Parquet rather than competing. The on-disk-vs-in-memory split is the design.
Arrow's reach is at the wire level: Spark, DuckDB, Polars, pandas (via PyArrow), Velox, and Apache Datafusion all use Arrow buffers for inter-process or client-server transfer. Arrow is not a competing alternative to Parquet.
How do row groups, stripes, and record batches compare? #
The three formats partition a table horizontally before splitting it into columns, but they name and size those partitions differently. The shape of the partition controls parallelism, predicate-pushdown granularity, and metadata overhead.
| Format | Unit | Typical size | Storage | Default codec | Statistics depth |
|---|---|---|---|---|---|
| Apache Parquet | Row group → column chunk → page | 64–512 MB row group | On-disk | Snappy | File + row group + page (v2.x) |
| Apache ORC | Stripe → column → row group within stripe | ~200 MB stripe | On-disk | Zlib | File + stripe + row group |
| Apache Arrow | RecordBatch → array | KB–MB per batch | In-memory (IPC on-disk) | LZ4 (IPC) | None at format level |
Row group and stripe size is a tuning knob. Too large hurts predicate pushdown and parallelism. Too small hurts compression and explodes metadata overhead. The 64–512 MB band is the production sweet spot for Parquet, with smaller values for latency-sensitive lakes and larger values for archival workloads. Arrow batches are smaller, typically 1024–65,536 rows, because the goal is fast in-process iteration, not skip-on-disk.
When do native engine formats beat open formats? #
Native engine formats (ClickHouse MergeTree parts, Snowflake micro-partitions, DuckDB's native database file) typically outperform Parquet on the same data when the engine owns the storage. They use the same columnar layout, but the engine controls sort order, sparse indexes, skip indexes, and per-column codecs that an interchange format either does not expose or stores less efficiently.
ClickHouse, for example, layers compression codecs on top of dictionary or delta encodings per column, maintains a sparse primary index keyed on the table's sort key, and stores data skipping indexes (min-max, set, Bloom) that Parquet's footer cannot replicate. On a selective filter against an identical dataset, MergeTree often scans substantially fewer bytes than Parquet because the sort order and skip indexes prune granules before the codec runs.
The native-vs-open choice depends on portability. Open formats win when the same data must be read by multiple engines, or stored in a vendor-neutral lake. Native formats win when one engine owns the workload. The engine can be honest about sort order, indexes, and merge policy in ways a portable spec cannot.