To read an ORC file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then point it at the file with the file() function:
clickhouse local -q "SELECT * FROM file('events.orc') LIMIT 10"
┌────────────────────event_time─┬─user_id─┬─country─┬─device──┬─event_type─┬─revenue─┬─quantity─┐
1. │ 2026-01-01 00:00:00.000000000 │ 1 │ GB │ desktop │ click │ 5 │ 1 │
2. │ 2026-01-01 00:02:17.000000000 │ 2 │ US │ mobile │ view │ 6.01 │ 2 │
3. │ 2026-01-01 00:04:34.000000000 │ 3 │ DE │ tablet │ purchase │ 7.02 │ 3 │
4. │ 2026-01-01 00:06:51.000000000 │ 4 │ FR │ desktop │ refund │ 8.03 │ 4 │
5. │ 2026-01-01 00:09:08.000000000 │ 5 │ IN │ mobile │ click │ 9.04 │ 5 │
└───────────────────────────────┴─────────┴─────────┴─────────┴────────────┴─────────┴──────────┘
ORC carries its own schema in a footer, so ClickHouse reads the column names and types straight from the file with no import step and no CREATE TABLE.
ORC stores its schema in a footer at the end of the file. ClickHouse reads it, so you never write CREATE TABLE. DESCRIBE prints the column names and the types that came back:
clickhouse local -q "DESCRIBE file('events.orc') FORMAT PrettyCompact"
┌─name───────┬─type────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
1. │ event_time │ Nullable(DateTime64(9)) │ │ │ │ │ │
2. │ user_id │ Nullable(Int64) │ │ │ │ │ │
3. │ country │ Nullable(String) │ │ │ │ │ │
4. │ device │ Nullable(String) │ │ │ │ │ │
5. │ event_type │ Nullable(String) │ │ │ │ │ │
6. │ revenue │ Nullable(Float64) │ │ │ │ │ │
7. │ quantity │ Nullable(Int8) │ │ │ │ │ │
└────────────┴─────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
Timestamps come back as DateTime64, integers keep their width, and the columns are Nullable because ORC tracks nullability per column. The empty default_type, codec_expression and ttl_expression columns are CREATE TABLE metadata that an ORC file doesn't carry, so they stay blank.
Filter, aggregate, and group by
A viewer shows you rows. A query engine answers questions. Because the file is just a SQL source, the full ClickHouse dialect works on it — WHERE, GROUP BY, aggregates, window functions, joins — with no load step in between:
clickhouse local -q "
SELECT country,
count() AS events,
round(sum(revenue), 2) AS revenue,
round(avg(quantity), 3) AS avg_qty
FROM file('events.orc')
WHERE event_type = 'purchase'
GROUP BY country
ORDER BY revenue DESC
FORMAT PrettyCompact"
┌─country─┬─events─┬─revenue─┬─avg_qty─┐
1. │ GB │ 2 │ 34.24 │ 3 │
2. │ DE │ 2 │ 26.16 │ 4 │
3. │ IN │ 1 │ 15.1 │ 1 │
└─────────┴────────┴─────────┴─────────┘
Columnar by stripe: only the columns you ask for get read
ORC lays data out by column, not by row, and groups rows into stripes (typically tens of thousands of rows each). Each stripe holds the column data plus an index with per-column statistics like min and max. Two things follow from that layout, and they're the reason ORC is a query format rather than a transport format:
- Column pruning. A query that touches two columns reads only those two off disk. The other five are never decoded. The aggregation below scans 3,000,000 rows but only ever looks at
country and revenue:
clickhouse local -q "
SELECT country, round(sum(revenue), 2) AS rev
FROM file('events_large.orc')
GROUP BY country
ORDER BY rev DESC
LIMIT 5
FORMAT PrettyCompact"
┌─country─┬─────────rev─┐
1. │ JP │ 75125313.31 │
2. │ AU │ 75114924.84 │
3. │ CA │ 75095320.01 │
4. │ US │ 75066382.72 │
5. │ DE │ 74978831.54 │
└─────────┴─────────────┘
- Stripe skipping. When a
WHERE clause filters on a column, ClickHouse can use the per-stripe min/max statistics to skip whole stripes whose range can't match, without decoding them.
This is the same columnar model as Parquet: ORC's stripes play the role Parquet's row groups do, and both keep per-chunk statistics for skipping. The formats are close cousins, and clickhouse-local reads both with the identical file() call. ORC tends to show up in Hive and Spark-era data lakes; Parquet is the more common interchange format today. If you have a choice and the data is moving between tools, Parquet is usually the safer default, but ORC reads just as directly here.
Inference is right almost always, because the footer is authoritative. When you want to read just a subset of columns, or pin a type, pass the ORC format and an explicit structure as the second and third arguments to file():
clickhouse local -q "
SELECT country, count() AS events
FROM file('events.orc', 'ORC', 'country String, revenue Float64')
GROUP BY country ORDER BY country
FORMAT PrettyCompact"
┌─country─┬─events─┐
1. │ AU │ 3 │
2. │ DE │ 3 │
3. │ FR │ 3 │
4. │ GB │ 4 │
5. │ IN │ 3 │
6. │ US │ 4 │
└─────────┴────────┘
Naming only the columns you list keeps the read narrow even if the file has dozens of columns.
ORC compresses its stripes internally (Zlib, Zstd, LZ4, Snappy), and ClickHouse reads the codec from the footer, so there's nothing to pass. An ORC file wrapped in an outer .orc.gz is also decompressed on the fly, with the extension telling ClickHouse what to do:
clickhouse local -q "SELECT count() FROM file('events.orc.gz') FORMAT PrettyCompact"
┌─count()─┐
1. │ 20 │
└─────────┘
Small files are instant in anything; the difference shows at scale. On a 3,000,000-row, ~28 MB ORC file (events_large.orc, generated by the example folder below), the filter-and-group-by query runs in:
clickhouse local --time -q "
SELECT country, count(), round(sum(revenue), 2), round(avg(quantity), 3)
FROM file('events_large.orc')
WHERE event_type = 'purchase'
GROUP BY country ORDER BY 3 DESC
FORMAT Null"
~0.28 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). Column pruning means the query reads only the columns it touches, and it runs across all cores.
The query you just ran on a laptop file is the same SQL you'd run on a ClickHouse server, or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. Swap file('events.orc') for a table name, or for an s3() function pointing at a bucket of ORC files, and the rest stays put. You prototype against a file on your machine and ship the identical logic to production.
The complete, runnable example lives here. It has generate.sh (builds the demo file and the 3M-row perf file), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-orc
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-orc
./generate.sh && ./run.sh