How to read an Arrow file in Python (faster than pandas)

Al Brown
Last updated: Jun 15, 2026

To read an Arrow file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_feather("data/events.arrow")   # Arrow IPC files are read by the feather reader

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read an Arrow file into a DataFrame #

Arrow IPC files (.arrow) share an on-disk layout with the Feather format, so read_feather is the right reader for both. Pass the file path and print the result:

1import chdb.datastore as pd
2
3df = pd.read_feather("data/events.arrow")
4print(df.head(8))
5print(df.dtypes)
1   id country event_type  quantity  amount  event_time
20   0      GB       cart         8  350.98  1768110393
31   1      IN       view         6  830.73  1772107899
42   2      DE       cart        10  981.54  1771606592
53   3      DE       view         6  528.92  1772725868
64   4      US       cart         5  804.15  1774390759
75   5      GB       view         9  363.26  1772159477
86   6      IN   purchase        10   23.42  1769334173
97   7      US       view        10  747.28  1773645610
10id             uint64
11country        object
12event_type     object
13quantity       uint16
14amount        float64
15event_time     uint32
16dtype: object

chDB reads the column names and types from the Arrow schema. id comes back as uint64, amount as float64, and event_time as uint32. The types are already embedded in the file, so there is no guessing or declaration step.

df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result. You trigger that by calling .head(), taking len(), or calling .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1import chdb.datastore as pd
2
3df = pd.read_feather("data/events.arrow")
4counts = (df[df["event_type"] == "purchase"]
5          .groupby("country")["id"].count()
6          .sort_values(ascending=False))
7print(counts)
1country
2GB    200304
3DE    200026
4AU    199997
5US    199997
6IN    199631
7Name: id, dtype: int64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = pd.read_feather("data/events.arrow")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than pyarrow + pandas? #

Arrow plays to pyarrow's strengths. It is pyarrow's native on-disk format, so reading it is fast. chDB still edges ahead because it only touches the columns and rows the query needs, while the pyarrow path decodes the whole table into a pandas frame before grouping. The same filter and aggregate above on a 3M-row (~69 MB) Arrow file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1pyarrow + pandas:               0.052s
2import chdb.datastore as pd:    0.036s
3speedup:                        1.5x

About 1.5x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because chDB never has to build the full frame just to answer a grouped query.

The small case can cut the other way: on a tiny Arrow file (a few thousand rows) pyarrow is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data to scan. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample Arrow file plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-arrow

Working with other formats? Read a Parquet file in Python and read a CSV file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. The same code scales from a file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...