What is an Avro file?

Al Brown
Last updated: Jun 15, 2026

An Avro file is a row-oriented, binary data format that stores its own schema, written as JSON, in the file header. Each file is self-describing: the schema travels with the data, so any reader can decode the records without an external definition. Avro was built for schema evolution, which makes it a common choice for streaming and data-pipeline payloads where the shape of a record changes over time.

At a glance #

PropertyAvro
LayoutRow-oriented (records stored whole, one after another)
SchemaEmbedded in the file header as JSON (self-describing)
CompressionPer-block: deflate, snappy, zstd (codec recorded in the header)
Schema evolutionFirst-class: add, remove or rename fields with defaults/aliases
Typical useStreaming, message queues, data-pipeline records, RPC
Open standardApache Avro; readable by most data tools

Internal structure #

An Avro object container file has three parts:

  • Header: the four magic bytes Obj\x01, then file metadata as key/value pairs. Two keys matter: avro.schema (the full record schema as JSON) and avro.codec (the block compression codec). A 16-byte sync marker closes the header.
  • Data blocks: the records, grouped into blocks. Each block stores its row count, its compressed size, the encoded rows, then the sync marker. Within a block, every record is serialised field-by-field in schema order.
  • Sync markers: the random 16-byte value repeated between blocks. A reader that seeks into the middle of a large file scans forward to the next marker to find a block boundary, which makes Avro splittable for parallel processing.

The defining trait is the embedded schema. Describing it is one thing; reading it out of a real file is more convincing. So let's write an Avro file, then crack open its header.

clickhouse local can do both. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Write a small event file, events_v1.avro:

1clickhouse local -q "
2SELECT
3    number AS id,
4    toDateTime('2026-01-01 00:00:00') + toIntervalMinute(number) AS event_time,
5    ['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
6    round(randUniform(1, 500), 2) AS revenue
7FROM numbers(1000)
8INTO OUTFILE 'events_v1.avro'
9FORMAT Avro"

Because the schema is embedded, you can read the file straight back with no structure argument. DESCRIBE reads the header and reports the columns it found:

1clickhouse local -q "DESCRIBE file('events_v1.avro')"
1id	Int64
2event_time	Int32
3country	String
4revenue	Float64

The embedded schema, read from the header #

That column list came from the JSON schema sitting in the file header. We can pull it out verbatim. Read the raw bytes, match the avro.schema record JSON with a regular expression, and pretty-print it:

1clickhouse local -q "
2SELECT extractGroups(raw, '(\{\"type\":\"record\".*?\}\]\})')[1] AS avro_schema
3FROM file('events_v1.avro', RawBLOB, 'raw String')
4FORMAT TSVRaw" | python3 -m json.tool
1{
2    "type": "record",
3    "name": "row",
4    "fields": [
5        {
6            "name": "id",
7            "type": "long"
8        },
9        {
10            "name": "event_time",
11            "type": "int"
12        },
13        {
14            "name": "country",
15            "type": "string"
16        },
17        {
18            "name": "revenue",
19            "type": "double"
20        }
21    ]
22}

This is the actual definition stored in the file, not an inference. The Avro types (long, int, string, double) are what ClickHouse mapped to Int64, Int32, String and Float64 in the DESCRIBE above. Every Avro file carries this; it is what makes the format self-describing.

Compression and encoding #

Avro compresses per data block, and records the codec in the header under avro.codec (the file above uses snappy by default). Because compression is applied to whole rows rather than to columns of one type, an Avro file typically does not pack as tightly as a columnar format. The trade is that a row-oriented layout is cheap to append to and cheap to read one whole record at a time, which suits streaming producers and consumers.

Querying the file is plain SQL. Read it directly and aggregate:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue)) AS revenue
3FROM file('events_v1.avro')
4GROUP BY country
5ORDER BY revenue DESC
6FORMAT Pretty"
1   ┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
2   ┃ country ┃ events ┃ revenue ┃
3   ┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
41. │ DE      │    200 │   53729 │
5   ├─────────┼────────┼─────────┤
62. │ GB      │    200 │   51090 │
7   ├─────────┼────────┼─────────┤
83. │ AU      │    200 │   50394 │
9   ├─────────┼────────┼─────────┤
104. │ IN      │    200 │   50181 │
11   ├─────────┼────────┼─────────┤
125. │ US      │    200 │   49363 │
13   └─────────┴────────┴─────────┘

On a larger 3,000,000-row file (42 MB on disk), scanning and aggregating by country runs in 0.270s (best of three, warm cache, Apple M4 Pro, 24 GB RAM, macOS; the figure may be nudged by other work on the machine).

Schema evolution, the reason Avro exists #

A record's shape changes over time. Avro is designed for that: a writer can add a field, and a reader using a newer schema with a default still decodes older records. To see the schema actually change, write a v2 file that adds a channel field, then read its embedded schema:

1clickhouse local -q "
2SELECT extractGroups(raw, '(\{\"type\":\"record\".*?\}\]\})')[1] AS avro_schema
3FROM file('events_v2.avro', RawBLOB, 'raw String')
4FORMAT TSVRaw" | python3 -m json.tool
1{
2    "type": "record",
3    "name": "row",
4    "fields": [
5        {
6            "name": "id",
7            "type": "long"
8        },
9        {
10            "name": "event_time",
11            "type": "int"
12        },
13        {
14            "name": "country",
15            "type": "string"
16        },
17        {
18            "name": "revenue",
19            "type": "double"
20        },
21        {
22            "name": "channel",
23            "type": "string"
24        }
25    ]
26}

The header now lists five fields. Each Avro file is read against its own embedded schema, so the new file decodes cleanly with the extra column:

1   ┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓
2   ┃ channel  ┃ events ┃ revenue ┃
3   ┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩
41. │ organic  │    334 │   83430 │
5   ├──────────┼────────┼─────────┤
62. │ paid     │    333 │   85806 │
7   ├──────────┼────────┼─────────┤
83. │ referral │    333 │   81942 │
9   └──────────┴────────┴─────────┘

One practical note when reading many Avro files at once: a query that references channel across both v1 and v2 files will fail on the older file, because Avro decodes each file strictly against the schema in its own header. That strictness is the point. Evolution is handled at write time, by giving new fields defaults or aliases, not by guessing at read time.

How it compares to Parquet #

Avro and Parquet are both binary, self-describing and open, and they are often used together: Avro for the row-by-row pipeline, Parquet for the columnar store the rows land in. The difference is layout. Avro is row-oriented, so it writes and reads whole records cheaply, which suits streaming. Parquet is columnar, so an analytical query reads only the columns it needs and skips the rest, which makes it faster to scan and smaller on disk for the same data. Choose Avro for moving records; choose Parquet for querying them at rest. When you need to move from one to the other, see convert Avro to Parquet.

For other row-oriented binary formats, see what is BSON and what is MessagePack.

Read one yourself #

You don't need a server or an import step to work with Avro. clickhouse local reads an Avro file in place from the command line, using the schema in the file's own header. ClickHouse uses the same SQL whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local Avro file scales up unchanged.

For the full walkthrough, covering globbing many files, projecting columns and casting types, see how to read an Avro file.

Run it yourself: the data generator and every command above live in local-analytics/what-is-an-avro-file in the ClickHouse examples repo.

Prefer Python? → Read an Avro file in Python.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...