How to read a Parquet file in Python (faster than pandas)

Al Brown
Last updated: Jun 8, 2026

To read a Parquet file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_parquet("data/events.parquet")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Prefer the command line? See how to query a Parquet file to do this with clickhouse local.

Read a Parquet file into a DataFrame #

pd.read_parquet reads the Parquet metadata and infers each column's type from the file's embedded schema. Point it at a path and print the result:

1import chdb.datastore as pd
2
3df = pd.read_parquet("data/events.parquet")
4print(df)
5print(df.dtypes)
1                id country  ... amount                event_time
20                0      GB  ... 350.98 2026-01-11 05:46:33+00:00
31                1      IN  ... 830.73 2026-02-26 12:11:39+00:00
42                2      DE  ... 981.54 2026-02-20 16:56:32+00:00
53                3      DE  ... 528.92 2026-03-05 15:51:08+00:00
64                4      US  ... 804.15 2026-03-24 22:19:19+00:00
7...            ...     ...  ...    ...                       ...
8
9[20000000 rows x 6 columns]
10id                         uint64
11country                       str
12event_type                    str
13quantity                   uint16
14amount                    float64
15event_time    datetime64[ns, UTC]
16dtype: object

The types came from the Parquet file itself, not from a dtype map you provided. event_time is a timezone-aware datetime; id is unsigned; Parquet stored the schema so chDB reads it without guessing.

One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only reads the columns and rows you actually use.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1import chdb.datastore as pd
2import pandas
3
4pandas.set_option("display.float_format", "{:.2f}".format)  # full decimals, not 6.67e+08
5
6df = pd.read_parquet("data/events.parquet")
7revenue = (df[df["event_type"] == "purchase"]
8           .groupby("country")["amount"].sum()
9           .sort_values(ascending=False))
10print(revenue)
1country
2IN    667411452.65
3US    667068922.60
4AU    666980194.83
5GB    666350260.61
6DE    666284177.10
7Name: amount, dtype: float64

Same syntax as pandas, same result. The difference is that chDB compiles the whole chain into one optimized query and only reads the columns the aggregation touches, rather than decoding the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, plotting, anything that mutates in place), call .to_pandas() to materialize one:

1df = pd.read_parquet("data/events.parquet")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than pandas.read_parquet? #

On a file big enough that scanning is the actual cost, yes. Parquet is a columnar format, so chDB can skip columns the aggregation never touches. The aggregation above on a 20M-row (~260 MB) Parquet file on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1import pandas as pd:            0.357s
2import chdb.datastore as pd:    0.115s
3speedup:                        3.1x

About 3x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because plain pandas decodes the entire file into memory before it can filter or group anything, while chDB streams and aggregates in one pass using only the columns it needs.

The small case can cut the other way: on a tiny Parquet file (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample Parquet file plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-parquet

Working with other formats? Read a CSV file in Python and read a JSON file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. If you have CSV data you read repeatedly, convert it to Parquet once and every later read gets faster.

The same code scales from a Parquet file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...