How to read a BSON file in Python

To read a BSON file in Python, use chDB. There is no pandas.read_bson, but chDB's DataStore reads it directly into a DataFrame-like object you work with using the pandas API:

1from chdb.datastore import DataStore
2df = DataStore.from_file("data/events.bson", format="BSONEachRow")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read a BSON file into a DataFrame

Point DataStore.from_file at the path, tell it the format is BSONEachRow, and print the result:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.bson", format="BSONEachRow")
4print(df)
5print(df.dtypes)

1event_id country event_type  revenue
20         0      GB   purchase    37.41
31         1      AU       view    76.47
42         2      IN       cart    67.61
53         3      GB   purchase    87.69
64         4      AU       view    18.12
75         5      IN       cart    35.58
8event_id        int64
9country           str
10event_type        str
11revenue       float64
12dtype: object

chDB infers the schema from the documents: event_id is an integer, revenue is a float. You declare nothing.

df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

BSON (Binary JSON) is the binary document format MongoDB stores internally. The .bson files you query offline almost always come from an export: mongoexport --type=bson, or the per-collection dumps that mongodump writes. Each file is a stream of length-prefixed documents, one after another.

Filter and aggregate the way you already do

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/events.bson", format="BSONEachRow")
4purchases = df[df["event_type"] == "purchase"].groupby("country")["revenue"].sum()
5print(purchases.to_pandas().round(2))

1country
2GB    125.1
3Name: revenue, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

Note that groupby(...)[col].sum() returns a ColumnExpr. Call .to_pandas() to get a real pandas Series with the group labels as the index.

Hand off to real pandas when you need it

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = DataStore.from_file("data/events.bson", format="BSONEachRow")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than decoding with pymongo?

The alternative Python path is the bson package that ships with pymongo. You open the file, decode every document into a Python dict, build a DataFrame, then group it:

1import pandas as pd
2from bson import decode_file_iter
3
4with open("events.bson", "rb") as f:
5    pdf = pd.DataFrame(decode_file_iter(f))
6
7p = pdf[pdf.event_type == b"purchase"]
8result = p.groupby("country")["revenue"].sum()

That works, but decode_file_iter materializes every document into a Python dict before pandas sees it, and BSON string values arrive as bytes (hence the b"purchase" comparison). chDB reads the same file in compiled code, with no Python-object layer in the hot path.

The aggregation above on a 2M-row (~149 MB) BSON file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1pymongo decode + pandas:        2.865s
2import chdb.datastore:          0.066s
3speedup:                        43.5x

About 43x here. The ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap is large because per-document Python decoding is the dominant cost on the pymongo side. The chDB path also stays one read call whether the export is 1 MB or 100 GB.

The small case can cut the other way: on a tiny BSON file (a few thousand documents) pymongo decode is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large.

Works in Jupyter

Printing a chDB DataStore renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself

The complete, runnable example is here, with generate.sh to create the sample .bson files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-bson

Working with other formats? Read a JSON file in Python and read a Parquet file in Python use the same DataStore pattern; only the reader changes.

The same code scales from a .bson file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Read a BSON file into a DataFrame

Filter and aggregate the way you already do

Hand off to real pandas when you need it

Is it faster than decoding with pymongo?

Works in Jupyter

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to read a BSON file in Python

Read a BSON file into a DataFrame #

Filter and aggregate the way you already do #

Hand off to real pandas when you need it #

Is it faster than decoding with pymongo? #

Works in Jupyter #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Read a BSON file into a DataFrame

Filter and aggregate the way you already do

Hand off to real pandas when you need it

Is it faster than decoding with pymongo?

Works in Jupyter

Run it yourself