ORC file format

The ORC (Optimized Row Columnar) file format is an open, column-oriented, compressed, self-describing storage format for tabular data. It stores data by column inside horizontal slices called stripes, embeds its own schema and per-column statistics, and was built for large-scale analytics. Because it is columnar and indexed, an engine can read only the columns and the slices a query needs. You can read one in place with clickhouse local: clickhouse local -q "SELECT * FROM file('events.orc')".

At a glance

Property	ORC
Name	Optimized Row Columnar
Layout	Columnar, grouped into stripes
Schema	Embedded in the file footer (self-describing)
Compression	Per-column: zstd, zlib, snappy, lz4, none
Statistics	File, stripe, and row-group (index) min/max + counts
Typical use	Analytics, data lakes, Hive/Spark ecosystems
Open standard	Apache ORC; readable by most analytics tools

Internal structure

An ORC file is a container read from the back. From the outside in:

File: one or more stripes, followed by a file footer and a postscript. The postscript sits at the very end and records the compression codec and the footer's length, so a reader starts at the tail and works backward.
File footer: the schema, the row count, and the list of stripes with their offsets and statistics. This is what makes the file self-describing.
Stripe: a horizontal slice of rows (default target ~64 MB). Each stripe is independently readable and holds its own index, the column data, and a stripe footer.
Row group / index stride: within a stripe, columns are sub-divided every N rows (default 10,000). ORC stores min/max statistics per stride in a built-in index, so it can skip strides that cannot match a predicate without decompressing them.

That layout is easy to describe. It is more convincing to read it out of a real file, so let's write one and crack it open.

clickhouse local writes and reads ORC. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Generate a 1,000,000-row demo file, events.orc:

1clickhouse local -q "
2SELECT
3    number AS id,
4    toDateTime('2026-01-01 00:00:00') + toIntervalMinute(number) AS event_time,
5    ['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
6    ['mobile','desktop','tablet'][(number % 3) + 1] AS device,
7    ['view','click','purchase'][(number % 3) + 1] AS event_type,
8    round(randUniform(1, 500), 2) AS revenue,
9    toUInt8((number % 4) + 1) AS quantity
10FROM numbers(1000000)
11INTO OUTFILE 'events.orc'
12FORMAT ORC"

ORC is self-describing, so ClickHouse infers the schema straight from the footer. DESCRIBE reads it back without scanning the data:

1clickhouse local -q "DESCRIBE file('events.orc') FORMAT Pretty"

1┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
2   ┃ name       ┃ type                    ┃ default_type ┃ default_expression ┃ comment ┃ codec_expression ┃ ttl_expression ┃
3   ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
41. │ id         │ Nullable(Int64)         │              │                    │         │                  │                │
52. │ event_time │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
63. │ country    │ Nullable(String)        │              │                    │         │                  │                │
74. │ device     │ Nullable(String)        │              │                    │         │                  │                │
85. │ event_type │ Nullable(String)        │              │                    │         │                  │                │
96. │ revenue    │ Nullable(Float64)       │              │                    │         │                  │                │
107. │ quantity   │ Nullable(Int8)          │              │                    │         │                  │                │
11   └────────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Types come from the file, not a guess: integers stay integers, the timestamp comes back as DateTime64. ORC columns are nullable by definition, which is why each type is wrapped in Nullable.

Reading the data

Querying is the same SQL you would run against any file. Filter and aggregate over a million rows:

1clickhouse local -q "
2SELECT country, count() AS purchases, round(sum(revenue)) AS revenue
3FROM file('events.orc')
4WHERE event_type = 'purchase'
5GROUP BY country
6ORDER BY revenue DESC
7FORMAT Pretty"

1┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
2   ┃ country ┃ purchases ┃  revenue ┃
3   ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
41. │ US      │     66667 │ 16722791 │
52. │ DE      │     66666 │ 16704759 │
63. │ IN      │     66667 │ 16692131 │
74. │ AU      │     66666 │ 16687078 │
85. │ GB      │     66667 │ 16651774 │
9   └─────────┴───────────┴──────────┘

Because ORC is columnar, this query touches only country, event_type and revenue; the other four columns are never read. Scanning, filtering and aggregating all 1,000,000 rows takes 0.026s (best of three, warm, on an Apple M4 Pro, 14 cores, 24 GB RAM). The columnar layout is why the engine reads three columns instead of seven.

Compression and encoding

Compression is applied per column, and because a column holds one type with a similar distribution, it packs down well. ORC supports zstd, zlib, snappy, lz4 and none, and layers lightweight encodings underneath: dictionary encoding for repeated strings, run-length encoding for runs of the same value, and delta encoding for sequential integers. Compression is applied in fixed blocks (the compression block size), so a reader can decompress one block at a time rather than a whole column.

The footer carries the proof. ClickHouse reads ORC data natively but has no ORC-metadata FORMAT the way it exposes Parquet footers, so we use a standard ORC reader (pyarrow) to print the footer:

1import pyarrow.orc as orc
2f = orc.ORCFile("events.orc")
3print(f"rows:             {f.nrows}")
4print(f"stripes:          {f.nstripes}")
5print(f"row_index_stride: {f.row_index_stride}")
6print(f"row_index_groups: {f.nrows // f.row_index_stride}")
7print(f"compression:      {f.compression}")
8print(f"compression_block:{f.compression_size} bytes")
9print(f"writer:           {f.writer}")
10print(f"file_version:     {f.file_version}")
11print(f"content_length:   {f.content_length} bytes (stripe data)")
12print(f"footer_length:    {f.file_footer_length} bytes")
13print(f"postscript_length:{f.file_postscript_length} bytes")
14print(f"file_length:      {f.file_length} bytes total")

1rows:             1000000
2stripes:          1
3row_index_stride: 10000
4row_index_groups: 100
5compression:      ZSTD
6compression_block:262144 bytes
7writer:           ORC_CPP
8file_version:     0.12
9content_length:   3281780 bytes (stripe data)
10footer_length:    332 bytes
11postscript_length:25 bytes
12file_length:      3282343 bytes total

A million rows, ZSTD compressed, in a single stripe (each stripe targets ~64 MB, and this dataset fits in one). The 10,000-row index stride splits that stripe into 100 indexed row groups, each carrying its own min/max statistics. The footer is 332 bytes and the postscript 25 bytes: the entire map of the file is a tiny tail you read first, then you jump straight to the column data you want.

How it compares to Parquet

ORC and Parquet are close cousins: both are open, columnar, compressed and self-describing, and a query engine treats them much the same. The differences are organisational. ORC groups rows into stripes with a built-in row-index every 10,000 rows; Parquet groups rows into row groups and pages. ORC grew up in the Hive and Spark world, Parquet in the broader data-lake ecosystem, so which one you meet usually depends on the tool that wrote it. For analytics on your laptop the choice rarely matters; both read fast. If you need to move between them, see convert ORC to Parquet and convert Parquet to ORC.

Read one yourself

You don't need a server or an import step to work with ORC. clickhouse local reads an ORC file in place from the command line; it's a good single-binary tool for poking at files on your laptop. ClickHouse uses the same SQL whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local ORC file scales up unchanged.

For the full command-line walkthrough, covering globbing many files, projecting columns and casting types, see how to read an ORC file. Prefer Python? See read an ORC file in Python. To turn a CSV into ORC, see convert CSV to ORC.

Run it yourself: the data generator and every command above live in local-analytics/orc-file-format in the ClickHouse examples repo.

At a glance

Internal structure

Reading the data

Compression and encoding

How it compares to Parquet

Read one yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

ORC file format

At a glance #

Internal structure #

Reading the data #

Compression and encoding #

How it compares to Parquet #

Read one yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

At a glance

Internal structure

Reading the data

Compression and encoding

How it compares to Parquet

Read one yourself