Read Avro from a schema registry with clickhouse-local

To read Avro files and Confluent-framed Avro messages from a schema registry, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read a plain Avro file:

1clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"

1┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
21. │ 2026-01-01 │        1 │ login      │ GB      │  100000 │      0 │
32. │ 2026-01-02 │        2 │ click      │ US      │  100001 │   0.01 │
43. │ 2026-01-03 │        3 │ purchase   │ DE      │  100002 │   0.02 │
54. │ 2026-01-04 │        4 │ refund     │ FR      │  100003 │   0.03 │
65. │ 2026-01-05 │        5 │ logout     │ IN      │  100004 │   0.04 │
76. │ 2026-01-06 │        6 │ login      │ GB      │  100005 │   0.05 │
87. │ 2026-01-07 │        7 │ click      │ US      │  100006 │   0.06 │
98. │ 2026-01-08 │        8 │ purchase   │ DE      │  100007 │   0.07 │
10   └────────────┴──────────┴────────────┴─────────┴─────────┴────────┘

A plain .avro file embeds its own schema in the header, so ClickHouse reads the column names and types from the file itself with no import step. For Confluent-framed messages, where each message carries only a schema id rather than the full schema, you switch to the AvroConfluent format and supply the registry URL; the rest of this guide covers that path.

Read a plain .avro file with no schema declaration

The common case is a plain Avro Object Container File. It carries its own schema in the header, so you never write CREATE TABLE. Point file() at it and query:

1clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"

1┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
21. │ 2026-01-01 │        1 │ login      │ GB      │  100000 │      0 │
32. │ 2026-01-02 │        2 │ click      │ US      │  100001 │   0.01 │
43. │ 2026-01-03 │        3 │ purchase   │ DE      │  100002 │   0.02 │
54. │ 2026-01-04 │        4 │ refund     │ FR      │  100003 │   0.03 │
65. │ 2026-01-05 │        5 │ logout     │ IN      │  100004 │   0.04 │
76. │ 2026-01-06 │        6 │ login      │ GB      │  100005 │   0.05 │
87. │ 2026-01-07 │        7 │ click      │ US      │  100006 │   0.06 │
98. │ 2026-01-08 │        8 │ purchase   │ DE      │  100007 │   0.07 │
10   └────────────┴──────────┴────────────┴─────────┴─────────┴────────┘

DESCRIBE prints the schema ClickHouse read straight from the file:

1clickhouse local -q "DESCRIBE file('events.avro')"

1event_date	String
2event_id	Int64
3event_type	String
4country	String
5user_id	Int32
6amount	Float64

Because the file is a SQL source, the full dialect works on it. Filter, group, aggregate, join, no load step in between:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(amount), 2) AS total
3FROM file('events.avro')
4GROUP BY country
5ORDER BY country
6FORMAT PrettyCompact"

1┌─country─┬─events─┬─total─┐
21. │ DE      │      2 │  0.09 │
32. │ FR      │      1 │  0.03 │
43. │ GB      │      2 │  0.05 │
54. │ IN      │      1 │  0.04 │
65. │ US      │      2 │  0.07 │
7   └─────────┴────────┴───────┘

If all you have is a plain .avro on disk, you are done. The read an Avro file guide goes deeper on that path. The registry format below is a separate framing.

The gotcha: plain Avro and Confluent-framed Avro are not the same bytes

This is the one thing that trips people up. The two formats start with different magic bytes, and the engine checks them.

A plain Avro Object Container File begins with the ASCII magic Obj followed by 0x01:

1clickhouse local -q "SELECT hex(substring(file('events.avro', 'RawBLOB'), 1, 4)) AS magic_hex"

14F626A01

4F 62 6A is Obj. A Confluent wire-format message has no such header. Each message begins with a single 0x00 magic byte, then a 4-byte big-endian schema id, then the raw Avro body. There is no embedded schema, only the id, which is why a registry is required to decode it.

So if you point the AvroConfluent format at a plain .avro file, it reads O (0x4F) where it expected 0x00 and stops immediately:

1clickhouse local -q "
2SELECT * FROM file('events.avro', 'AvroConfluent')
3SETTINGS format_avro_schema_registry_url = 'http://localhost:8081'"

1Code: 117. DB::Exception: Invalid magic byte before AvroConfluent schema
2identifier. Must be zero byte, found 79 instead. (INCORRECT_DATA)

That error (79 is decimal for 0x4F) is the clearest signal you have the wrong framing: you fed an Object Container File to a reader expecting Confluent-framed bytes, or vice versa.

Read Confluent-framed Avro from the registry

When the bytes really are Confluent-framed, for example a dump of Kafka message values produced with the Confluent Avro serializer, use AvroConfluent and give ClickHouse the registry URL. ClickHouse reads the schema id from each message, fetches the matching schema from the registry, and decodes the rows:

1clickhouse local -q "
2SELECT *
3FROM file('messages.bin', 'AvroConfluent')
4SETTINGS format_avro_schema_registry_url = 'http://schema-registry:8081'"

The same applies inside a ClickHouse server when you create a Kafka engine table reading an Avro topic: you set kafka_format = 'AvroConfluent' and the same format_avro_schema_registry_url setting, and ClickHouse resolves schemas the same way.

This step needs a reachable Schema Registry; there is no local stand-in for it, and that is by design, since the schema lives in the registry, not in the message. The example folder below proves the plain-Avro read end to end and reproduces the magic-byte error exactly; it does not fabricate registry responses.

How fast is it on a real file?

Small files are instant in anything. The difference shows at scale. On a 3,000,000-row, ~62 MB plain Avro file (events_large.avro, generated by the example folder), a full GROUP BY country with sum and avg over every row completes in:

1clickhouse local --time -q "
2SELECT country, count() AS events, round(sum(amount),2) AS total, round(avg(amount),3) AS avg_amt
3FROM file('events_large.avro')
4GROUP BY country ORDER BY total DESC
5FORMAT Null"

~0.53 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). Avro is row-oriented, so the whole record is decoded per row, yet the aggregation still runs across all cores and finishes before you can switch windows. If you query the data repeatedly, convert it to Parquet once and the columnar scans get faster again.

The same SQL scales unchanged

The query you ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... GROUP BY changes. You swap file('events.avro') for a Kafka engine table reading the same Avro topic, keep AvroConfluent and format_avro_schema_registry_url, and the rest stays put. You prototype against a file and ship the identical logic to production.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the demo file and the ~62 MB perf file), run.sh (every command above, including the reproduced magic-byte error), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-avro-confluent

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-avro-confluent
3./generate.sh && ./run.sh

Prefer Python? → Read an Avro file in Python

Read a plain .avro file with no schema declaration

The gotcha: plain Avro and Confluent-framed Avro are not the same bytes

Read Confluent-framed Avro from the registry

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

Read Avro from a schema registry with clickhouse-local

Read a plain .avro file with no schema declaration #

The gotcha: plain Avro and Confluent-framed Avro are not the same bytes #

Read Confluent-framed Avro from the registry #

How fast is it on a real file? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

Read a plain .avro file with no schema declaration

The gotcha: plain Avro and Confluent-framed Avro are not the same bytes

Read Confluent-framed Avro from the registry

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Related