How to read an ORC file from the command line

To read an ORC file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the file with the file() function:

1clickhouse local -q "SELECT * FROM file('events.orc') LIMIT 10"

1┌────────────────────event_time─┬─user_id─┬─country─┬─device──┬─event_type─┬─revenue─┬─quantity─┐
21. │ 2026-01-01 00:00:00.000000000 │       1 │ GB      │ desktop │ click      │       5 │        1 │
32. │ 2026-01-01 00:02:17.000000000 │       2 │ US      │ mobile  │ view       │    6.01 │        2 │
43. │ 2026-01-01 00:04:34.000000000 │       3 │ DE      │ tablet  │ purchase   │    7.02 │        3 │
54. │ 2026-01-01 00:06:51.000000000 │       4 │ FR      │ desktop │ refund     │    8.03 │        4 │
65. │ 2026-01-01 00:09:08.000000000 │       5 │ IN      │ mobile  │ click      │    9.04 │        5 │
7   └───────────────────────────────┴─────────┴─────────┴─────────┴────────────┴─────────┴──────────┘

ORC carries its own schema in a footer, so ClickHouse reads the column names and types straight from the file with no import step and no CREATE TABLE.

See the schema without declaring one

ORC stores its schema in a footer at the end of the file. ClickHouse reads it, so you never write CREATE TABLE. DESCRIBE prints the column names and the types that came back:

1clickhouse local -q "DESCRIBE file('events.orc') FORMAT PrettyCompact"

1┌─name───────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ event_time │ Nullable(DateTime64(9)) │              │                    │         │                  │                │
32. │ user_id    │ Nullable(Int64)         │              │                    │         │                  │                │
43. │ country    │ Nullable(String)        │              │                    │         │                  │                │
54. │ device     │ Nullable(String)        │              │                    │         │                  │                │
65. │ event_type │ Nullable(String)        │              │                    │         │                  │                │
76. │ revenue    │ Nullable(Float64)       │              │                    │         │                  │                │
87. │ quantity   │ Nullable(Int8)          │              │                    │         │                  │                │
9   └────────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

Timestamps come back as DateTime64, integers keep their width, and the columns are Nullable because ORC tracks nullability per column. The empty default_type, codec_expression and ttl_expression columns are CREATE TABLE metadata that an ORC file doesn't carry, so they stay blank.

Filter, aggregate, and group by

A viewer shows you rows. A query engine answers questions. Because the file is just a SQL source, the full ClickHouse dialect works on it — WHERE, GROUP BY, aggregates, window functions, joins — with no load step in between:

1clickhouse local -q "
2SELECT country,
3       count() AS events,
4       round(sum(revenue), 2) AS revenue,
5       round(avg(quantity), 3) AS avg_qty
6FROM file('events.orc')
7WHERE event_type = 'purchase'
8GROUP BY country
9ORDER BY revenue DESC
10FORMAT PrettyCompact"

1┌─country─┬─events─┬─revenue─┬─avg_qty─┐
21. │ GB      │      2 │   34.24 │       3 │
32. │ DE      │      2 │   26.16 │       4 │
43. │ IN      │      1 │    15.1 │       1 │
5   └─────────┴────────┴─────────┴─────────┘

Columnar by stripe: only the columns you ask for get read

ORC lays data out by column, not by row, and groups rows into stripes (typically tens of thousands of rows each). Each stripe holds the column data plus an index with per-column statistics like min and max. Two things follow from that layout, and they're the reason ORC is a query format rather than a transport format:

Column pruning. A query that touches two columns reads only those two off disk. The other five are never decoded. The aggregation below scans 3,000,000 rows but only ever looks at country and revenue:

1clickhouse local -q "
2SELECT country, round(sum(revenue), 2) AS rev
3FROM file('events_large.orc')
4GROUP BY country
5ORDER BY rev DESC
6LIMIT 5
7FORMAT PrettyCompact"

1┌─country─┬─────────rev─┐
21. │ JP      │ 75125313.31 │
32. │ AU      │ 75114924.84 │
43. │ CA      │ 75095320.01 │
54. │ US      │ 75066382.72 │
65. │ DE      │ 74978831.54 │
7   └─────────┴─────────────┘

Stripe skipping. When a WHERE clause filters on a column, ClickHouse can use the per-stripe min/max statistics to skip whole stripes whose range can't match, without decoding them.

This is the same columnar model as Parquet: ORC's stripes play the role Parquet's row groups do, and both keep per-chunk statistics for skipping. The formats are close cousins, and clickhouse-local reads both with the identical file() call. ORC tends to show up in Hive and Spark-era data lakes; Parquet is the more common interchange format today. If you have a choice and the data is moving between tools, Parquet is usually the safer default, but ORC reads just as directly here.

Override the inferred structure when you need to

Inference is right almost always, because the footer is authoritative. When you want to read just a subset of columns, or pin a type, pass the ORC format and an explicit structure as the second and third arguments to file():

1clickhouse local -q "
2SELECT country, count() AS events
3FROM file('events.orc', 'ORC', 'country String, revenue Float64')
4GROUP BY country ORDER BY country
5FORMAT PrettyCompact"

1┌─country─┬─events─┐
21. │ AU      │      3 │
32. │ DE      │      3 │
43. │ FR      │      3 │
54. │ GB      │      4 │
65. │ IN      │      3 │
76. │ US      │      4 │
8   └─────────┴────────┘

Naming only the columns you list keeps the read narrow even if the file has dozens of columns.

Compressed ORC reads transparently

ORC compresses its stripes internally (Zlib, Zstd, LZ4, Snappy), and ClickHouse reads the codec from the footer, so there's nothing to pass. An ORC file wrapped in an outer .orc.gz is also decompressed on the fly, with the extension telling ClickHouse what to do:

1clickhouse local -q "SELECT count() FROM file('events.orc.gz') FORMAT PrettyCompact"

1┌─count()─┐
21. │      20 │
3   └─────────┘

How fast is it on a real file?

Small files are instant in anything; the difference shows at scale. On a 3,000,000-row, ~28 MB ORC file (events_large.orc, generated by the example folder below), the filter-and-group-by query runs in:

1clickhouse local --time -q "
2SELECT country, count(), round(sum(revenue), 2), round(avg(quantity), 3)
3FROM file('events_large.orc')
4WHERE event_type = 'purchase'
5GROUP BY country ORDER BY 3 DESC
6FORMAT Null"

10.28

~0.28 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). Column pruning means the query reads only the columns it touches, and it runs across all cores.

The same SQL scales unchanged

The query you just ran on a laptop file is the same SQL you'd run on a ClickHouse server, or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. Swap file('events.orc') for a table name, or for an s3() function pointing at a bucket of ORC files, and the rest stays put. You prototype against a file on your machine and ship the identical logic to production.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the demo file and the 3M-row perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-orc

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-orc
3./generate.sh && ./run.sh

Prefer Python? → Read an ORC file in Python

See the schema without declaring one

Filter, aggregate, and group by

Columnar by stripe: only the columns you ask for get read

Override the inferred structure when you need to

Compressed ORC reads transparently

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

Best AlloyDB alternatives in 2026 for enterprise PostgreSQL workloads

ClickHouse concurrency: how to size for user-facing analytics

When to denormalize, when to join: A ClickHouse guide

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to read an ORC file from the command line

See the schema without declaring one #

Filter, aggregate, and group by #

Columnar by stripe: only the columns you ask for get read #

Override the inferred structure when you need to #

Compressed ORC reads transparently #

How fast is it on a real file? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

Best AlloyDB alternatives in 2026 for enterprise PostgreSQL workloads

ClickHouse concurrency: how to size for user-facing analytics

When to denormalize, when to join: A ClickHouse guide

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

See the schema without declaring one

Filter, aggregate, and group by

Columnar by stripe: only the columns you ask for get read

Override the inferred structure when you need to

Compressed ORC reads transparently

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Related