How to read an Avro file in Python

Al Brown
Last updated: Jun 15, 2026

To read an Avro file in Python, use chDB to load it into a DataFrame and work with it using the pandas API you already know:

1from chdb.datastore import DataStore
2df = DataStore.from_file("data/events.avro", format="Avro")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read an Avro file into a DataFrame #

DataStore.from_file reads the schema from the Avro file header and infers every column's type automatically. Point it at a path and print the result:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.avro", format="Avro")
4print(df)
5print(df.dtypes)
1   event_id country event_type  amount           user
20         0      GB   purchase   50.98    [100, gold]
31         1      AU       view   30.73  [101, silver]
42         2      IN   purchase   81.54  [102, bronze]
53         3      GB       view   28.92    [103, gold]
64         4      AU   purchase    4.15  [104, silver]
7event_id        int64
8country           str
9event_type        str
10amount        float64
11user           object
12dtype: object

chDB read the types from the Avro schema embedded in the file header: event_id is an integer, amount is a float, and the nested user record (an Avro record-within-a-record) lands as a struct-valued column. You did not declare a single type.

df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.avro", format="Avro")
4revenue = df[df["event_type"] == "purchase"].groupby("country")["amount"].sum()
5print(revenue)
1country
2AU     4.15
3GB    50.98
4IN    81.54
5Name: amount, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = DataStore.from_file("data/events.avro", format="Avro")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the DataStore object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than a fastavro reader loop? #

On anything sizeable, yes. The standard Python way to read Avro is fastavro. For a filter-and-aggregate, you open the file, iterate every record, reach into nested keys by hand, and accumulate into a dict. Every record is decoded into a Python dict before you can touch it, and every filter and group key is code you write and maintain.

chDB decodes in compiled ClickHouse code and never builds a Python object per row. The same aggregation on a 3M-row Avro file (~19 MB compressed) on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1fastavro (manual loop):         1.642s
2import chdb.datastore as pd:    0.288s
3speedup:                        5.7x

About 5.7x here. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because the fastavro loop must decode every record into a Python dict before it can filter anything, while chDB streams and aggregates in one pass.

The small case can cut the other way: on a tiny Avro file (a few thousand rows) fastavro is competitive, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large.

The codec note #

Avro compresses its data blocks. ClickHouse writes Avro with snappy by default. fastavro can read snappy only if you also pip install cramjam; without it you get a snappy codec is supported but you need to install one of the following libraries: ('cramjam',) error. Writing with the deflate codec sidesteps that, because deflate decodes with the Python standard library. chDB reads snappy, deflate, and uncompressed Avro out of the box; the codec only matters for the other reader.

Works in Jupyter #

Printing a DataStore object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample Avro files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-avro

New to the format? See what is an Avro file. Working with other formats? Read a Parquet file in Python and read a JSON file in Python use the same import chdb.datastore as pd pattern; only the reader changes.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...