To read Avro files and Confluent-framed Avro messages from a schema registry, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then read a plain Avro file:
clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"
┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
1. │ 2026-01-01 │ 1 │ login │ GB │ 100000 │ 0 │
2. │ 2026-01-02 │ 2 │ click │ US │ 100001 │ 0.01 │
3. │ 2026-01-03 │ 3 │ purchase │ DE │ 100002 │ 0.02 │
4. │ 2026-01-04 │ 4 │ refund │ FR │ 100003 │ 0.03 │
5. │ 2026-01-05 │ 5 │ logout │ IN │ 100004 │ 0.04 │
6. │ 2026-01-06 │ 6 │ login │ GB │ 100005 │ 0.05 │
7. │ 2026-01-07 │ 7 │ click │ US │ 100006 │ 0.06 │
8. │ 2026-01-08 │ 8 │ purchase │ DE │ 100007 │ 0.07 │
└────────────┴──────────┴────────────┴─────────┴─────────┴────────┘
A plain .avro file embeds its own schema in the header, so ClickHouse reads the column names and types from the file itself with no import step. For Confluent-framed messages, where each message carries only a schema id rather than the full schema, you switch to the AvroConfluent format and supply the registry URL; the rest of this guide covers that path.
The common case is a plain Avro Object Container File. It carries its own schema in the header, so you never write CREATE TABLE. Point file() at it and query:
clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"
┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
1. │ 2026-01-01 │ 1 │ login │ GB │ 100000 │ 0 │
2. │ 2026-01-02 │ 2 │ click │ US │ 100001 │ 0.01 │
3. │ 2026-01-03 │ 3 │ purchase │ DE │ 100002 │ 0.02 │
4. │ 2026-01-04 │ 4 │ refund │ FR │ 100003 │ 0.03 │
5. │ 2026-01-05 │ 5 │ logout │ IN │ 100004 │ 0.04 │
6. │ 2026-01-06 │ 6 │ login │ GB │ 100005 │ 0.05 │
7. │ 2026-01-07 │ 7 │ click │ US │ 100006 │ 0.06 │
8. │ 2026-01-08 │ 8 │ purchase │ DE │ 100007 │ 0.07 │
└────────────┴──────────┴────────────┴─────────┴─────────┴────────┘
DESCRIBE prints the schema ClickHouse read straight from the file:
clickhouse local -q "DESCRIBE file('events.avro')"
event_date String
event_id Int64
event_type String
country String
user_id Int32
amount Float64
Because the file is a SQL source, the full dialect works on it. Filter, group, aggregate, join, no load step in between:
clickhouse local -q "
SELECT country, count() AS events, round(sum(amount), 2) AS total
FROM file('events.avro')
GROUP BY country
ORDER BY country
FORMAT PrettyCompact"
┌─country─┬─events─┬─total─┐
1. │ DE │ 2 │ 0.09 │
2. │ FR │ 1 │ 0.03 │
3. │ GB │ 2 │ 0.05 │
4. │ IN │ 1 │ 0.04 │
5. │ US │ 2 │ 0.07 │
└─────────┴────────┴───────┘
If all you have is a plain .avro on disk, you are done. The read an Avro file guide goes deeper on that path. The registry format below is a separate framing.
The gotcha: plain Avro and Confluent-framed Avro are not the same bytes
This is the one thing that trips people up. The two formats start with different magic bytes, and the engine checks them.
A plain Avro Object Container File begins with the ASCII magic Obj followed by 0x01:
clickhouse local -q "SELECT hex(substring(file('events.avro', 'RawBLOB'), 1, 4)) AS magic_hex"
4F 62 6A is Obj. A Confluent wire-format message has no such header. Each message begins with a single 0x00 magic byte, then a 4-byte big-endian schema id, then the raw Avro body. There is no embedded schema, only the id, which is why a registry is required to decode it.
So if you point the AvroConfluent format at a plain .avro file, it reads O (0x4F) where it expected 0x00 and stops immediately:
clickhouse local -q "
SELECT * FROM file('events.avro', 'AvroConfluent')
SETTINGS format_avro_schema_registry_url = 'http://localhost:8081'"
Code: 117. DB::Exception: Invalid magic byte before AvroConfluent schema
identifier. Must be zero byte, found 79 instead. (INCORRECT_DATA)
That error (79 is decimal for 0x4F) is the clearest signal you have the wrong framing: you fed an Object Container File to a reader expecting Confluent-framed bytes, or vice versa.
When the bytes really are Confluent-framed, for example a dump of Kafka message values produced with the Confluent Avro serializer, use AvroConfluent and give ClickHouse the registry URL. ClickHouse reads the schema id from each message, fetches the matching schema from the registry, and decodes the rows:
clickhouse local -q "
SELECT *
FROM file('messages.bin', 'AvroConfluent')
SETTINGS format_avro_schema_registry_url = 'http://schema-registry:8081'"
The same applies inside a ClickHouse server when you create a Kafka engine table reading an Avro topic: you set kafka_format = 'AvroConfluent' and the same format_avro_schema_registry_url setting, and ClickHouse resolves schemas the same way.
This step needs a reachable Schema Registry; there is no local stand-in for it, and that is by design, since the schema lives in the registry, not in the message. The example folder below proves the plain-Avro read end to end and reproduces the magic-byte error exactly; it does not fabricate registry responses.
Small files are instant in anything. The difference shows at scale. On a 3,000,000-row, ~62 MB plain Avro file (events_large.avro, generated by the example folder), a full GROUP BY country with sum and avg over every row completes in:
clickhouse local --time -q "
SELECT country, count() AS events, round(sum(amount),2) AS total, round(avg(amount),3) AS avg_amt
FROM file('events_large.avro')
GROUP BY country ORDER BY total DESC
FORMAT Null"
~0.53 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). Avro is row-oriented, so the whole record is decoded per row, yet the aggregation still runs across all cores and finishes before you can switch windows. If you query the data repeatedly, convert it to Parquet once and the columnar scans get faster again.
The query you ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... GROUP BY changes. You swap file('events.avro') for a Kafka engine table reading the same Avro topic, keep AvroConfluent and format_avro_schema_registry_url, and the rest stays put. You prototype against a file and ship the identical logic to production.
The complete, runnable example lives here. It has generate.sh (builds the demo file and the ~62 MB perf file), run.sh (every command above, including the reproduced magic-byte error), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-avro-confluent
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-avro-confluent
./generate.sh && ./run.sh