Columnar storage formats: Parquet, ORC, and Arrow explained

Q: Are Iceberg, Delta Lake, and Hudi alternatives to Parquet?

No. Apache Iceberg, Delta Lake, and Apache Hudi are table formats that sit on top of Parquet. Iceberg also supports ORC and Avro, Delta is Parquet-only, and Hudi defaults to Parquet with optional Avro for log files. They add ACID transactions, schema evolution, time travel, and snapshot isolation by managing metadata files alongside the underlying Parquet data. The data is still Parquet on disk, and the table format tells an engine which Parquet files belong to the current snapshot.

Columnar storage formats define how column-oriented data is laid out, separately from the engines that read it. Apache Parquet (released July 2013) is the cloud-lake default. Apache ORC (released with Hive 0.12 in October 2013) is Hive-native with deeper per-stripe statistics. Apache Arrow (released February 2016) is an in-memory layout for zero-copy interchange. They are file or memory layouts, not databases.

TL;DR #

Apache Parquet (July 2013) and Apache ORC (October 2013) are on-disk columnar file formats. Apache Arrow (February 2016) is an in-memory layout for cross-process exchange.
Parquet groups data into row groups of 64–512 MB, each holding one column chunk per column, with per-row-group statistics in the file footer for predicate pushdown.
ORC stores data in stripes typically around 200 MB, with statistics at file, stripe, and row-group-within-stripe levels (three layers of skip metadata).
Arrow defines RecordBatches with columnar buffers laid out for SIMD. Spark, DuckDB, Polars, and pandas use Arrow at the wire level, but rarely as long-term storage.
Native engine formats (ClickHouse MergeTree parts, Snowflake micro-partitions, DuckDB's native database file) typically read fewer bytes than Parquet on the same data because the engine controls sort order, sparse indexes, and skip indexes.
Iceberg, Delta Lake, and Apache Hudi are table formats on top of Parquet (or ORC/Avro). They add transactions and schema evolution. The data is still Parquet on disk.

What are columnar storage formats? #

Columnar storage formats specify how column-oriented data is organised in files or memory buffers. Each column's values are stored contiguously, with metadata describing types, compression, and statistics. The format is read by a query engine. Parquet, ORC, and Arrow are formats; ClickHouse, DuckDB, and Snowflake are engines that consume them.

The distinction matters because formats and engines evolve on separate cadences. A Parquet file can be read by Spark, Trino, Snowflake, DuckDB, Athena, BigQuery, and ClickHouse. The engine adds query planning, indexes beyond per-block statistics, and a query runtime. The format defines only the bytes on disk and how to decode them. The same separation is described in the Apache Arrow columnar format spec, which positions Arrow as the in-memory counterpart to on-disk formats like Parquet.

Two layout decisions define every columnar format: how to chunk a table horizontally so a reader can parallelise and skip work, and how to encode each column's values for size and decoding speed. Parquet, ORC, and Arrow make different choices on both axes, which is why they coexist instead of converging. For a primer on the underlying layout, see what is columnar storage. For the broader category, see the columnar database overview.

What is Apache Parquet? #

Apache Parquet is an on-disk columnar file format released in July 2013 by Twitter and Cloudera, inspired by Google's Dremel paper. A Parquet file is split into row groups (typically 64–512 MB), each row group holds one column chunk per column, and each column chunk is split into pages. Per-row-group statistics (min, max, null counts) sit in the file footer and drive predicate pushdown.

Parquet 2.x added page indexes, Bloom filters, and column indexes that store per-page statistics, narrowing what was previously ORC's clearest advantage. The default compression codec is Snappy, with ZSTD as the common alternative for archival workloads, and LZ4, GZIP, and Brotli also supported. Encodings include dictionary, RLE/bit-packing, and delta. The format is encoding-rich enough that compression ratios commonly land in the 5–10× band, with column-specific cases beyond 30×.

Parquet's reach is the reason it became the cloud-lake default. Spark, Trino, Presto, Hive, ClickHouse, DuckDB, Polars, BigQuery, Snowflake, Athena, and Redshift all read Parquet directly. ClickHouse's Parquet integration reads and writes partitioned Parquet datasets directly from S3-compatible object storage.

What is Apache ORC? #

Apache ORC (Optimized Row Columnar) is an on-disk format released with Apache Hive 0.12 in October 2013, designed to replace Hive's earlier RCFile. An ORC file is split into stripes, each stripe contains the row data plus a stripe footer with column-level statistics, and the file footer holds aggregated statistics across stripes. The ORC v1 specification notes that stripes are "typically ~200MB". The Hive default orc.stripe.size is 64 MB, but production configurations commonly tune this higher.

ORC stores statistics at three levels: file, stripe, and row group within a stripe (default 10,000 rows per row-group index entry per the spec). That depth is the basis of its predicate-pushdown advantage. Default compression is Zlib, with Snappy and ZSTD as alternatives. The format supports ACID transactions via delta files: updates land in delta files alongside the base files, and a compactor merges them periodically. This is what makes ORC the natural format for Hive ACID tables.

ORC's ecosystem is narrower than Parquet's. Inside the Hive/HDFS world it is the dominant format. Outside, support is thinner. Spark, Trino, and Presto read ORC well, while ClickHouse, DuckDB, BigQuery, and Snowflake have varying levels of support. The common pattern is Hive-on-ORC for legacy batch workloads, Parquet everywhere else.

What is Apache Arrow? #

Apache Arrow is an in-memory columnar format released in February 2016, designed for zero-copy data exchange between processes. The Arrow columnar format spec defines RecordBatches as ordered collections of arrays. Each field is a contiguous values buffer plus a validity bitmap, with no per-row overhead and a layout aligned for SIMD instructions.

The critical distinction from Parquet and ORC is that Arrow is in-memory, not on-disk. The Arrow IPC format and Feather V2 do exist as on-disk wrappers, but those are uncommon for production storage. Arrow's primary job is to live in RAM and move between processes. When DuckDB hands a result back to pandas, when Spark passes data to a Python UDF, when Polars exchanges with a JVM client, the layout is Arrow. Wes McKinney's response to Daniel Abadi's columnar comparison frames Arrow as complementary to Parquet rather than competing. The on-disk-vs-in-memory split is the design.

Arrow's reach is at the wire level: Spark, DuckDB, Polars, pandas (via PyArrow), Velox, and Apache Datafusion all use Arrow buffers for inter-process or client-server transfer. Arrow is not a competing alternative to Parquet.

How do row groups, stripes, and record batches compare? #

The three formats partition a table horizontally before splitting it into columns, but they name and size those partitions differently. The shape of the partition controls parallelism, predicate-pushdown granularity, and metadata overhead.

Format	Unit	Typical size	Storage	Default codec	Statistics depth
Apache Parquet	Row group → column chunk → page	64–512 MB row group	On-disk	Snappy	File + row group + page (v2.x)
Apache ORC	Stripe → column → row group within stripe	~200 MB stripe	On-disk	Zlib	File + stripe + row group
Apache Arrow	RecordBatch → array	KB–MB per batch	In-memory (IPC on-disk)	LZ4 (IPC)	None at format level

Row group and stripe size is a tuning knob. Too large hurts predicate pushdown and parallelism. Too small hurts compression and explodes metadata overhead. The 64–512 MB band is the production sweet spot for Parquet, with smaller values for latency-sensitive lakes and larger values for archival workloads. Arrow batches are smaller, typically 1024–65,536 rows, because the goal is fast in-process iteration, not skip-on-disk.

When do native engine formats beat open formats? #

Native engine formats (ClickHouse MergeTree parts, Snowflake micro-partitions, DuckDB's native database file) typically outperform Parquet on the same data when the engine owns the storage. They use the same columnar layout, but the engine controls sort order, sparse indexes, skip indexes, and per-column codecs that an interchange format either does not expose or stores less efficiently.

ClickHouse, for example, layers compression codecs on top of dictionary or delta encodings per column, maintains a sparse primary index keyed on the table's sort key, and stores data skipping indexes (min-max, set, Bloom) that Parquet's footer cannot replicate. On a selective filter against an identical dataset, MergeTree often scans substantially fewer bytes than Parquet because the sort order and skip indexes prune granules before the codec runs.

The native-vs-open choice depends on portability. Open formats win when the same data must be read by multiple engines, or stored in a vendor-neutral lake. Native formats win when one engine owns the workload. The engine can be honest about sort order, indexes, and merge policy in ways a portable spec cannot.

Frequently asked questions

Are Iceberg, Delta Lake, and Hudi alternatives to Parquet?

No. Apache Iceberg, Delta Lake, and Apache Hudi are table formats that sit on top of Parquet. Iceberg also supports ORC and Avro, Delta is Parquet-only, and Hudi defaults to Parquet with optional Avro for log files. They add ACID transactions, schema evolution, time travel, and snapshot isolation by managing metadata files alongside the underlying Parquet data. The data is still Parquet on disk, and the table format tells an engine which Parquet files belong to the current snapshot.

What's the right row group size for Parquet?

64–512 MB is the standard band. Smaller row groups (16–64 MB) help interactive lakes where readers want fine-grained predicate pushdown and parallelism. Larger row groups (512 MB+) help archival storage where compression ratio matters more than skip granularity. Apache Spark's default is 128 MB, while Trino and Athena both work well in the 128–512 MB range.

Can I use Arrow as on-disk storage?

Technically yes (the Arrow IPC format and Feather V2 are on-disk wrappers for Arrow buffers), but it is rarely the right call for analytical storage. Arrow IPC files are larger than Parquet on the same data because Arrow does not apply column-level encodings like dictionary or RLE by default, and ecosystem read support outside Python and R is thin. Use Parquet for storage; use Arrow for in-memory exchange.

Why does Parquet 2.x close the gap with ORC?

Parquet 2.x added column indexes, page indexes, and Bloom filters, the metadata structures that previously gave ORC tighter predicate pushdown. The compression-ratio gap on identical data has also narrowed since Parquet adopted ZSTD. Outside Hive, Parquet 2.x is now the more pragmatic choice. Inside Hive ACID, ORC's delta-file model still wins.

What's the difference between Arrow and Parquet in a single sentence?

Parquet stores columnar data on disk. Arrow lays out columnar data in RAM. Engines write Parquet to S3 and decode it into Arrow buffers when serving queries. They are layers in the same stack, not competitors.