Read Avro from a schema registry with clickhouse-local

Al Brown
Last updated: Jun 15, 2026

To read Avro files and Confluent-framed Avro messages from a schema registry, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

curl https://clickhouse.com/cli | sh   # install clickhousectl
clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read a plain Avro file:

clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"
┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
1. │ 2026-01-01 │        1 │ login      │ GB      │  100000 │      0 │
2. │ 2026-01-02 │        2 │ click      │ US      │  100001 │   0.01 │
3. │ 2026-01-03 │        3 │ purchase   │ DE      │  100002 │   0.02 │
4. │ 2026-01-04 │        4 │ refund     │ FR      │  100003 │   0.03 │
5. │ 2026-01-05 │        5 │ logout     │ IN      │  100004 │   0.04 │
6. │ 2026-01-06 │        6 │ login      │ GB      │  100005 │   0.05 │
7. │ 2026-01-07 │        7 │ click      │ US      │  100006 │   0.06 │
8. │ 2026-01-08 │        8 │ purchase   │ DE      │  100007 │   0.07 │
   └────────────┴──────────┴────────────┴─────────┴─────────┴────────┘

A plain .avro file embeds its own schema in the header, so ClickHouse reads the column names and types from the file itself with no import step. For Confluent-framed messages, where each message carries only a schema id rather than the full schema, you switch to the AvroConfluent format and supply the registry URL; the rest of this guide covers that path.

Read a plain .avro file with no schema declaration

The common case is a plain Avro Object Container File. It carries its own schema in the header, so you never write CREATE TABLE. Point file() at it and query:

clickhouse local -q "SELECT * FROM file('events.avro') FORMAT PrettyCompact"
┌─event_date─┬─event_id─┬─event_type─┬─country─┬─user_id─┬─amount─┐
1. │ 2026-01-01 │        1 │ login      │ GB      │  100000 │      0 │
2. │ 2026-01-02 │        2 │ click      │ US      │  100001 │   0.01 │
3. │ 2026-01-03 │        3 │ purchase   │ DE      │  100002 │   0.02 │
4. │ 2026-01-04 │        4 │ refund     │ FR      │  100003 │   0.03 │
5. │ 2026-01-05 │        5 │ logout     │ IN      │  100004 │   0.04 │
6. │ 2026-01-06 │        6 │ login      │ GB      │  100005 │   0.05 │
7. │ 2026-01-07 │        7 │ click      │ US      │  100006 │   0.06 │
8. │ 2026-01-08 │        8 │ purchase   │ DE      │  100007 │   0.07 │
   └────────────┴──────────┴────────────┴─────────┴─────────┴────────┘

DESCRIBE prints the schema ClickHouse read straight from the file:

clickhouse local -q "DESCRIBE file('events.avro')"
event_date	String
event_id	Int64
event_type	String
country	String
user_id	Int32
amount	Float64

Because the file is a SQL source, the full dialect works on it. Filter, group, aggregate, join, no load step in between:

clickhouse local -q "
SELECT country, count() AS events, round(sum(amount), 2) AS total
FROM file('events.avro')
GROUP BY country
ORDER BY country
FORMAT PrettyCompact"
┌─country─┬─events─┬─total─┐
1. │ DE      │      2 │  0.09 │
2. │ FR      │      1 │  0.03 │
3. │ GB      │      2 │  0.05 │
4. │ IN      │      1 │  0.04 │
5. │ US      │      2 │  0.07 │
   └─────────┴────────┴───────┘

If all you have is a plain .avro on disk, you are done. The read an Avro file guide goes deeper on that path. The registry format below is a separate framing.

The gotcha: plain Avro and Confluent-framed Avro are not the same bytes

This is the one thing that trips people up. The two formats start with different magic bytes, and the engine checks them.

A plain Avro Object Container File begins with the ASCII magic Obj followed by 0x01:

clickhouse local -q "SELECT hex(substring(file('events.avro', 'RawBLOB'), 1, 4)) AS magic_hex"
4F626A01

4F 62 6A is Obj. A Confluent wire-format message has no such header. Each message begins with a single 0x00 magic byte, then a 4-byte big-endian schema id, then the raw Avro body. There is no embedded schema, only the id, which is why a registry is required to decode it.

So if you point the AvroConfluent format at a plain .avro file, it reads O (0x4F) where it expected 0x00 and stops immediately:

clickhouse local -q "
SELECT * FROM file('events.avro', 'AvroConfluent')
SETTINGS format_avro_schema_registry_url = 'http://localhost:8081'"
Code: 117. DB::Exception: Invalid magic byte before AvroConfluent schema
identifier. Must be zero byte, found 79 instead. (INCORRECT_DATA)

That error (79 is decimal for 0x4F) is the clearest signal you have the wrong framing: you fed an Object Container File to a reader expecting Confluent-framed bytes, or vice versa.

Read Confluent-framed Avro from the registry

When the bytes really are Confluent-framed, for example a dump of Kafka message values produced with the Confluent Avro serializer, use AvroConfluent and give ClickHouse the registry URL. ClickHouse reads the schema id from each message, fetches the matching schema from the registry, and decodes the rows:

clickhouse local -q "
SELECT *
FROM file('messages.bin', 'AvroConfluent')
SETTINGS format_avro_schema_registry_url = 'http://schema-registry:8081'"

The same applies inside a ClickHouse server when you create a Kafka engine table reading an Avro topic: you set kafka_format = 'AvroConfluent' and the same format_avro_schema_registry_url setting, and ClickHouse resolves schemas the same way.

This step needs a reachable Schema Registry; there is no local stand-in for it, and that is by design, since the schema lives in the registry, not in the message. The example folder below proves the plain-Avro read end to end and reproduces the magic-byte error exactly; it does not fabricate registry responses.

How fast is it on a real file?

Small files are instant in anything. The difference shows at scale. On a 3,000,000-row, ~62 MB plain Avro file (events_large.avro, generated by the example folder), a full GROUP BY country with sum and avg over every row completes in:

clickhouse local --time -q "
SELECT country, count() AS events, round(sum(amount),2) AS total, round(avg(amount),3) AS avg_amt
FROM file('events_large.avro')
GROUP BY country ORDER BY total DESC
FORMAT Null"

~0.53 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). Avro is row-oriented, so the whole record is decoded per row, yet the aggregation still runs across all cores and finishes before you can switch windows. If you query the data repeatedly, convert it to Parquet once and the columnar scans get faster again.

The same SQL scales unchanged

The query you ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... GROUP BY changes. You swap file('events.avro') for a Kafka engine table reading the same Avro topic, keep AvroConfluent and format_avro_schema_registry_url, and the rest stays put. You prototype against a file and ship the identical logic to production.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the demo file and the ~62 MB perf file), run.sh (every command above, including the reproduced magic-byte error), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-avro-confluent

git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-avro-confluent
./generate.sh && ./run.sh

Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->