How to read an Avro file

Al Brown
Last updated: Jun 15, 2026

To read an Avro file from the command line and query its rows, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then query the file directly:

1clickhouse local -q "SELECT * FROM file('events.avro') LIMIT 10"
1   ┌─event_date─┬──────────────event_time─┬─event_id─┬─country─┬─event_type─┬─revenue─┬─quantity─┐
21.2026-01-012026-01-01 00:00:00.0001 │ GB      │ click      │       5132.2026-01-022026-01-01 00:02:17.0002 │ US      │ view6.01243.2026-01-032026-01-01 00:04:34.0003 │ DE      │ purchase   │    7.0235   └────────────┴─────────────────────────┴──────────┴─────────┴────────────┴─────────┴──────────┘

The format is detected from the .avro extension. Avro carries its schema inside the file, so the file is read in place with no import step and no structure to declare.

See the schema without declaring one #

Avro is self-describing: every file begins with a header that holds its writer schema as JSON. That is the key difference from a format like CSV, where types are guessed from the data. Here the types are read, not inferred, so there is nothing to declare. DESCRIBE prints them:

1clickhouse local -q "DESCRIBE file('events.avro') FORMAT PrettyCompact"
1   ┌─name───────┬─type──────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ event_date │ Date32        │              │                    │         │                  │                │
32. │ event_time │ DateTime64(3) │              │                    │         │                  │                │
43. │ event_id   │ Int32         │              │                    │         │                  │                │
54. │ country    │ String        │              │                    │         │                  │                │
65. │ event_type │ String        │              │                    │         │                  │                │
76. │ revenue    │ Float64       │              │                    │         │                  │                │
87. │ quantity   │ Int32         │              │                    │         │                  │                │
9   └────────────┴───────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

(The empty columns are CREATE TABLE metadata such as defaults, codecs and TTLs, which an Avro file doesn't carry, so they come back blank.)

Logical types: dates and timestamps stay typed #

Avro's base types are deliberately small: int, long, float, double, bytes, string, boolean. Dates and timestamps ride on top as logical types, which are annotations layered over a base type. A date is an int tagged date; a millisecond timestamp is a long tagged timestamp-millis. clickhouse-local reads those annotations, so the columns come back as real Date32 and DateTime64(3) rather than bare integers:

1clickhouse local -q "
2SELECT event_date, toTypeName(event_date) AS date_type,
3       event_time, toTypeName(event_time) AS time_type
4FROM file('events.avro') LIMIT 3 FORMAT PrettyCompact"
1   ┌─event_date─┬─date_type─┬──────────────event_time─┬─time_type─────┐
21. │ 2026-01-01 │ Date32    │ 2026-01-01 00:00:00.000 │ DateTime64(3) │
32. │ 2026-01-02 │ Date32    │ 2026-01-01 00:02:17.000 │ DateTime64(3) │
43. │ 2026-01-03 │ Date32    │ 2026-01-01 00:04:34.000 │ DateTime64(3) │
5   └────────────┴───────────┴─────────────────────────┴───────────────┘

This matters because it is the common failure mode of weaker Avro readers: a timestamp that arrives as a raw 1767225600 instead of a date you can filter on. Here you can write WHERE event_date >= '2026-06-01' or toStartOfMonth(event_time) against the file with no casting. If a producer wrote a timestamp as a plain long with no logical-type tag, ClickHouse reads it as the integer it is, and a toDateTime64() in your query restores it. (What is an Avro file?)

Filter, aggregate, and group by #

Because the file is a SQL source, the full ClickHouse dialect works on it (WHERE, GROUP BY, aggregate functions, window functions, joins), not a fixed table UI:

1clickhouse local -q "
2SELECT country,
3       count() AS events,
4       round(sum(revenue), 2) AS revenue,
5       round(avg(quantity), 2) AS avg_qty
6FROM file('events.avro')
7WHERE event_type = 'purchase'
8GROUP BY country
9ORDER BY revenue DESC
10FORMAT PrettyCompact"
1   ┌─country─┬─events─┬─revenue─┬─avg_qty─┐
21. │ GB      │      2 │   34.24 │       3 │
32. │ DE      │      2 │   26.16 │       4 │
43. │ IN      │      1 │    15.1 │       1 │
5   └─────────┴────────┴─────────┴─────────┘

Avro is row-oriented, so a GROUP BY reads the whole file rather than a few columns the way a columnar format would. That is fine for the everyday case. It is how Avro is built, and it is what makes it a good fit for streaming and append-heavy pipelines. clickhouse-local reads Avro with one binary that also reads CSV, JSON, ORC, Parquet and many other formats, and runs the same SQL unchanged when you move from a file to a server.

Write Avro back out #

The conversion runs the other way too. SELECT from any source and write Avro with FORMAT Avro, and ClickHouse emits the matching logical types automatically (a Date becomes date, a DateTime64(3) becomes timestamp-millis):

1clickhouse local -q "
2SELECT toDate('2026-01-01') AS event_date,
3       toDateTime64('2026-01-01 00:00:00.000', 3) AS event_time,
4       'GB' AS country
5INTO OUTFILE 'out.avro' TRUNCATE FORMAT Avro"

DESCRIBE file('out.avro') then shows event_date back as Date32 and event_time as DateTime64(3) — the annotations survive the round trip. That is exactly how the sample file in the Run it yourself folder is built.

How fast is it on a real file? #

Small files are instant in anything; the difference shows up at scale. On a 3,000,000-row, ~61 MB Avro file (events_large.avro, generated by the example folder below), a filter-and-group-by across every row runs in:

1clickhouse local --time -q "
2SELECT country, count(), round(sum(revenue), 2), round(avg(quantity), 3)
3FROM file('events_large.avro')
4WHERE event_type = 'purchase'
5GROUP BY country ORDER BY 2 DESC
6FORMAT Null"

~0.56 seconds, best of three with the file warm in the OS page cache, measured on an Apple-silicon laptop (Apple M4 Pro, 14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes decoding the Avro container from scratch on every run — there is no cached table. The query runs across all cores, so a multi-million-row aggregation finishes before you can switch windows.

The same SQL scales unchanged #

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server, or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. You swap file('events.avro') for a table name and the rest stays put. There is no separate "local dialect" to unlearn: you prototype against a file on your machine and ship the identical logic to production.

Run it yourself #

The complete, runnable example lives here. It has generate.sh (builds the demo Avro file and the ~61 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-avro

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-avro
3./generate.sh && ./run.sh
Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...