The ORC (Optimized Row Columnar) file format is an open, column-oriented, compressed, self-describing storage format for tabular data. It stores data by column inside horizontal slices called stripes, embeds its own schema and per-column statistics, and was built for large-scale analytics. Because it is columnar and indexed, an engine can read only the columns and the slices a query needs. You can read one in place with clickhouse local: clickhouse local -q "SELECT * FROM file('events.orc')".
| Property | ORC |
|---|
| Name | Optimized Row Columnar |
| Layout | Columnar, grouped into stripes |
| Schema | Embedded in the file footer (self-describing) |
| Compression | Per-column: zstd, zlib, snappy, lz4, none |
| Statistics | File, stripe, and row-group (index) min/max + counts |
| Typical use | Analytics, data lakes, Hive/Spark ecosystems |
| Open standard | Apache ORC; readable by most analytics tools |
An ORC file is a container read from the back. From the outside in:
- File: one or more stripes, followed by a file footer and a postscript. The postscript sits at the very end and records the compression codec and the footer's length, so a reader starts at the tail and works backward.
- File footer: the schema, the row count, and the list of stripes with their offsets and statistics. This is what makes the file self-describing.
- Stripe: a horizontal slice of rows (default target ~64 MB). Each stripe is independently readable and holds its own index, the column data, and a stripe footer.
- Row group / index stride: within a stripe, columns are sub-divided every N rows (default 10,000). ORC stores min/max statistics per stride in a built-in index, so it can skip strides that cannot match a predicate without decompressing them.
That layout is easy to describe. It is more convincing to read it out of a real file, so let's write one and crack it open.
clickhouse local writes and reads ORC. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:
Generate a 1,000,000-row demo file, events.orc:
clickhouse local -q "
SELECT
number AS id,
toDateTime('2026-01-01 00:00:00') + toIntervalMinute(number) AS event_time,
['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
['mobile','desktop','tablet'][(number % 3) + 1] AS device,
['view','click','purchase'][(number % 3) + 1] AS event_type,
round(randUniform(1, 500), 2) AS revenue,
toUInt8((number % 4) + 1) AS quantity
FROM numbers(1000000)
INTO OUTFILE 'events.orc'
FORMAT ORC"
ORC is self-describing, so ClickHouse infers the schema straight from the footer. DESCRIBE reads it back without scanning the data:
Types come from the file, not a guess: integers stay integers, the timestamp comes back as DateTime64. ORC columns are nullable by definition, which is why each type is wrapped in Nullable.
Querying is the same SQL you would run against any file. Filter and aggregate over a million rows:
Because ORC is columnar, this query touches only country, event_type and revenue; the other four columns are never read. Scanning, filtering and aggregating all 1,000,000 rows takes 0.026s (best of three, warm, on an Apple M4 Pro, 14 cores, 24 GB RAM). The columnar layout is why the engine reads three columns instead of seven.
Compression and encoding
Compression is applied per column, and because a column holds one type with a similar distribution, it packs down well. ORC supports zstd, zlib, snappy, lz4 and none, and layers lightweight encodings underneath: dictionary encoding for repeated strings, run-length encoding for runs of the same value, and delta encoding for sequential integers. Compression is applied in fixed blocks (the compression block size), so a reader can decompress one block at a time rather than a whole column.
The footer carries the proof. ClickHouse reads ORC data natively but has no ORC-metadata FORMAT the way it exposes Parquet footers, so we use a standard ORC reader (pyarrow) to print the footer:
A million rows, ZSTD compressed, in a single stripe (each stripe targets ~64 MB, and this dataset fits in one). The 10,000-row index stride splits that stripe into 100 indexed row groups, each carrying its own min/max statistics. The footer is 332 bytes and the postscript 25 bytes: the entire map of the file is a tiny tail you read first, then you jump straight to the column data you want.
ORC and Parquet are close cousins: both are open, columnar, compressed and self-describing, and a query engine treats them much the same. The differences are organisational. ORC groups rows into stripes with a built-in row-index every 10,000 rows; Parquet groups rows into row groups and pages. ORC grew up in the Hive and Spark world, Parquet in the broader data-lake ecosystem, so which one you meet usually depends on the tool that wrote it. For analytics on your laptop the choice rarely matters; both read fast. If you need to move between them, see convert ORC to Parquet and convert Parquet to ORC.
You don't need a server or an import step to work with ORC. clickhouse local reads an ORC file in place from the command line; it's a good single-binary tool for poking at files on your laptop. ClickHouse uses the same SQL whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local ORC file scales up unchanged.
For the full command-line walkthrough, covering globbing many files, projecting columns and casting types, see how to read an ORC file. Prefer Python? See read an ORC file in Python. To turn a CSV into ORC, see convert CSV to ORC.
Run it yourself: the data generator and every command above live in local-analytics/orc-file-format in the ClickHouse examples repo.