How to read a MessagePack file in Python

Al Brown
Last updated: Jun 15, 2026

To read a MessagePack file in Python, use chDB to load it into a DataFrame and work with it using the pandas API you already know, running on ClickHouse's engine. MsgPack carries no schema, so you pass the column list once via structure=:

1from chdb.datastore import DataStore
2
3SCHEMA = "event_id UInt64, country String, event_type String, amount Float64"
4df = DataStore.from_file("data/events.msgpack", format="MsgPack", structure=SCHEMA)

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read a MessagePack file into a DataFrame #

DataStore.from_file takes the file path, the format name, and the column structure. Print the result and its types:

1from chdb.datastore import DataStore
2
3SCHEMA = "event_id UInt64, country String, event_type String, amount Float64"
4df = DataStore.from_file("data/events.msgpack", format="MsgPack", structure=SCHEMA)
5print(df)
6print(df.dtypes)
1   event_id country event_type  amount
20         0      GB   purchase   91.64
31         1      AU       view   90.14
42         2      IN   purchase   40.73
53         3      US       view   41.03
64         4      DE   purchase   61.25
7event_id       uint64
8country           str
9event_type        str
10amount        float64
11dtype: object

The types match what you declared: event_id is an unsigned integer and amount is a float. df is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result: by printing it, calling len(), or calling .to_pandas().

The schema requirement: what happens if you forget structure= #

MsgPack is different from Parquet, ORC, or Avro. Those formats embed column names and types in the file, so chDB can infer a schema automatically. MsgPack is a compact binary encoding of values with no header describing the columns. Leave structure= out and the read fails:

1bad = DataStore.from_file("data/events.msgpack", format="MsgPack")
2bad.to_pandas()
1Data source not accessible: Query execution failed: Code: 636. DB::Exception:
2The table structure cannot be extracted from a MsgPack format file. Error: ...
3(CANNOT_EXTRACT_TABLE_STRUCTURE)

The fix is structure=: a comma-separated name Type list in the same order the columns were written. Get the order right and the read succeeds.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1from chdb.datastore import DataStore
2
3SCHEMA = "event_id UInt64, country String, event_type String, amount Float64"
4df = DataStore.from_file("data/events.msgpack", format="MsgPack", structure=SCHEMA)
5
6purchases = df[df["event_type"] == "purchase"].groupby("country")["amount"].sum()
7print(purchases)
1country
2DE    61.25
3GB    91.64
4IN    40.73
5Name: amount, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only processes the columns the aggregation touches. Joins, computed columns with .assign, and the .str and .dt accessors all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (plotting, scikit-learn, or anything that mutates in place), call .to_pandas() to materialize one:

1pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than reading with the msgpack library? #

The standard alternative is the msgpack library, which decodes the file into raw Python objects. ClickHouse writes MsgPack as a flat stream of values with no row boundaries and no field names, so Unpacker hands you one undifferentiated stream you have to reshape yourself:

1import msgpack
2from collections import defaultdict
3
4agg = defaultdict(lambda: [0, 0.0])
5with open("data/events.msgpack", "rb") as f:
6    vals = iter(msgpack.Unpacker(f, raw=False))
7    while True:
8        try:
9            next(vals)                     # event_id
10        except StopIteration:
11            break
12        country = next(vals)
13        event_type = next(vals)
14        amount = next(vals)
15        if event_type == "purchase":
16            a = agg[country]
17            a[0] += 1
18            a[1] += amount

You hard-code the column count, manage the loop yourself, and the grouping logic is yours to write and test. Both approaches need the column structure; the chDB version puts it in one place and the engine does the rest.

On timing, both approaches are bound primarily by the binary parsing pass through a 3M-row (~74 MB) file. The filter-and-sum above, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14:

1msgpack library (manual):           0.274s
2chdb.datastore (pandas, no SQL):    0.262s
3speedup:                            1.0x

The two are roughly the same here: parsing a flat binary stream is the bottleneck and both implementations do it at comparable speed. chDB pulls ahead on more complex aggregations, where the compiled engine does more work in one pass. For a simple filter-and-sum over MsgPack, the main advantage is the code you write rather than the runtime: the chDB version is shorter, puts the column declaration in one place, and you can swap in any pandas operation without rewriting the decode loop.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample MsgPack files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-messagepack

Working with other formats? Read a JSON file in Python and read a Parquet file in Python use the same import chdb.datastore as pd pattern; those formats carry their own schema, so no structure= is needed.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...