To read an Avro file in Python, use chDB to load it into a DataFrame and work with it using the pandas API you already know:
1from chdb.datastore import DataStore
2df = DataStore.from_file("data/events.avro", format="Avro")df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.
1pip install chdbRead an Avro file into a DataFrame #
DataStore.from_file reads the schema from the Avro file header and infers every column's type automatically. Point it at a path and print the result:
1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.avro", format="Avro")
4print(df)
5print(df.dtypes)1 event_id country event_type amount user
20 0 GB purchase 50.98 [100, gold]
31 1 AU view 30.73 [101, silver]
42 2 IN purchase 81.54 [102, bronze]
53 3 GB view 28.92 [103, gold]
64 4 AU purchase 4.15 [104, silver]
7event_id int64
8country str
9event_type str
10amount float64
11user object
12dtype: objectchDB read the types from the Avro schema embedded in the file header: event_id is an integer, amount is a float, and the nested user record (an Avro record-within-a-record) lands as a struct-valued column. You did not declare a single type.
df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.
Filter and aggregate the way you already do #
The pandas you write does not change. Filter with a boolean mask, group, and aggregate:
1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.avro", format="Avro")
4revenue = df[df["event_type"] == "purchase"].groupby("country")["amount"].sum()
5print(revenue)1country
2AU 4.15
3GB 50.98
4IN 81.54
5Name: amount, dtype: float64Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.
Hand off to real pandas when you need it #
When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:
1df = DataStore.from_file("data/events.avro", format="Avro")
2pdf = df.to_pandas() # a real pandas.DataFrame, in memoryThe usual pattern is to do the heavy filtering and aggregation on the DataStore object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.
Is it faster than a fastavro reader loop? #
On anything sizeable, yes. The standard Python way to read Avro is fastavro. For a filter-and-aggregate, you open the file, iterate every record, reach into nested keys by hand, and accumulate into a dict. Every record is decoded into a Python dict before you can touch it, and every filter and group key is code you write and maintain.
chDB decodes in compiled ClickHouse code and never builds a Python object per row. The same aggregation on a 3M-row Avro file (~19 MB compressed) on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:
1fastavro (manual loop): 1.642s
2import chdb.datastore as pd: 0.288s
3speedup: 5.7xAbout 5.7x here. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because the fastavro loop must decode every record into a Python dict before it can filter anything, while chDB streams and aggregates in one pass.
The small case can cut the other way: on a tiny Avro file (a few thousand rows) fastavro is competitive, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large.
The codec note #
Avro compresses its data blocks. ClickHouse writes Avro with snappy by default. fastavro can read snappy only if you also pip install cramjam; without it you get a snappy codec is supported but you need to install one of the following libraries: ('cramjam',) error. Writing with the deflate codec sidesteps that, because deflate decodes with the Python standard library. chDB reads snappy, deflate, and uncompressed Avro out of the box; the codec only matters for the other reader.
Works in Jupyter #
Printing a DataStore object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.
Run it yourself #
The complete, runnable example is here, with generate.sh to create the sample Avro files plus run.ipynb and a run.py mirror containing the exact code above:
github.com/ClickHouse/examples/tree/main/local-analytics/chdb-avro
New to the format? See what is an Avro file. Working with other formats? Read a Parquet file in Python and read a JSON file in Python use the same import chdb.datastore as pd pattern; only the reader changes.