How to read a Feather file in Python (faster than pandas)

Al Brown
Last updated: Jun 15, 2026

To read a Feather file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_feather("events.feather")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read a Feather file into a DataFrame #

pd.read_feather works the way you expect. Point it at a path and print the result:

1import chdb.datastore as pd
2
3df = pd.read_feather("events.feather")
4print(df.head(8))
5print(df.dtypes)
1   id country event_type  quantity  amount  event_time
20   0      GB       cart         8  350.98  1768110393
31   1      IN       view         6  830.73  1772107899
42   2      DE       cart        10  981.54  1771606592
53   3      DE       view         6  528.92  1772725868
64   4      US       cart         5  804.15  1774390759
75   5      GB       view         9  363.26  1772159477
86   6      IN   purchase        10   23.42  1769334173
97   7      US       view        10  747.28  1773645610
10id             uint64
11country        object
12event_type     object
13quantity        uint8
14amount        float64
15event_time     uint32
16dtype: object

The schema comes from the Arrow footer in the file, so column names and types are read directly. Feather is the on-disk name for Arrow IPC, and chDB reads it with ClickHouse's Arrow reader.

One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

Note that event_time came back as uint32. This file was written from a ClickHouse DateTime, which serializes to Arrow as a 32-bit unsigned epoch. If you need real datetime objects, materialize first and convert: pandas.to_datetime(pdf["event_time"], unit="s"). A file written from a pandas datetime column carries an Arrow timestamp type and reads back as a timestamp directly.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1import chdb.datastore as pd
2
3df = pd.read_feather("events.feather")
4revenue = (df[df["event_type"] == "purchase"]
5           .groupby("country")["amount"].sum()
6           .sort_values(ascending=False))
7print(revenue.to_pandas())
1country
2GB   100213415.44
3US   100120032.40
4AU   100062085.37
5DE    99909346.62
6IN    99687151.84
7Name: amount, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = pd.read_feather("events.feather")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than pandas.read_feather? #

Feather is already a columnar, memory-mappable format, so pandas reads it efficiently and the gap is narrower than it is for row-oriented text formats. The aggregation above on a 3M-row (~68 MB) Feather file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1import pandas as pd:              0.051s
2import chdb.datastore as pd:      0.033s
3speedup:                          1.5x

About 1.5x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because chDB scans only the columns and rows the aggregation needs while pandas.read_feather reads the full file into a DataFrame before it can group anything.

The small case can cut the other way: on a tiny Feather file (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample Feather file plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-feather

Working with other formats? Read an Arrow file in Python uses the same reader since Feather and Arrow IPC are the same format. Read a Parquet file in Python and read a CSV file in Python use the same import chdb.datastore as pd pattern; only the reader changes.

The same code scales from a file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...