To read an ORC file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:
1import chdb.datastore as pd
2df = pd.read_orc("data/events.orc")df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.
1pip install chdbRead an ORC file into a DataFrame #
pd.read_orc works the way you expect. Point it at a path and print the result:
1import chdb.datastore as pd
2
3df = pd.read_orc("data/events.orc")
4print(df.head(8))
5print(df.dtypes)1 id country event_type quantity amount event_time
20 0 GB cart 8 350.98 2026-01-11 05:46:33
31 1 IN view 6 830.73 2026-02-26 12:11:39
42 2 DE cart 10 981.54 2026-02-20 16:56:32
53 3 DE view 6 528.92 2026-03-05 15:51:08
64 4 US cart 5 804.15 2026-03-24 22:19:19
75 5 GB view 9 363.26 2026-02-27 02:31:17
86 6 IN purchase 10 23.42 2026-01-25 09:42:53
97 7 US view 10 747.28 2026-03-16 07:20:10
10id int64
11country object
12event_type object
13quantity int16
14amount float64
15event_time datetime64[ns]
16dtype: objectORC is a self-describing columnar format: every file carries a typed schema in its footer. chDB reads that footer, so id comes back as an integer, amount as a float, and event_time as a datetime, with no type declarations from you.
One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.
Filter and aggregate the way you already do #
The pandas you write does not change. Filter with a boolean mask, group, and aggregate:
1import chdb.datastore as pd
2import pandas
3
4pandas.set_option("display.float_format", "{:.2f}".format) # full decimals, not 1.00e+08
5
6df = pd.read_orc("data/events.orc")
7revenue = (df[df["event_type"] == "purchase"]
8 .groupby("country")["amount"].sum()
9 .sort_values(ascending=False))
10print(revenue)1country
2GB 100213415.44
3US 100120032.40
4AU 100062085.37
5DE 99909346.62
6IN 99687151.84
7Name: amount, dtype: float64Same syntax as pandas, same result. The difference is that chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than decoding the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.
Hand off to real pandas when you need it #
When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:
1df = pd.read_orc("data/events.orc")
2pdf = df.to_pandas() # a real pandas.DataFrame, in memoryThe usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.
Is it faster than pandas.read_orc? #
On a file big enough that scanning is the actual cost, yes. ORC is a columnar binary format, so chDB reads only the columns a query touches and does the aggregation in compiled, multi-threaded ClickHouse code. The aggregation above on a 3M-row (~25 MB) ORC file on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:
1import pandas as pd: 0.146s
2import chdb.datastore as pd: 0.076s
3speedup: 1.9xAbout 1.9x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow and as the fraction of columns you actually need shrinks, because chDB reads only what the filter and aggregation touch while pandas.read_orc decodes every column into memory first.
The small case can cut the other way: on a tiny ORC file (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.
Works in Jupyter #
Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.
Run it yourself #
The complete, runnable example is here, with generate.sh to create the sample ORC file plus run.ipynb and a run.py mirror containing the exact code above:
github.com/ClickHouse/examples/tree/main/local-analytics/chdb-orc
Working with other formats? Read a Parquet file in Python and read a JSON file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. If you keep re-reading the same ORC file and want faster subsequent reads, convert it to Parquet once and take advantage of broader tool support.
The same code scales from an ORC file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.