MessagePack (often written MsgPack) is a compact binary serialization format that encodes the same data model as JSON, including objects, arrays, strings, numbers, booleans and null, but in far fewer bytes. It is self-delimiting at the value level: each value starts with a one-byte tag that says what type and how long it is. To read a stream of MessagePack rows with clickhouse local, you give it the column types and query the file in place:
1clickhouse local -q "
2SELECT * FROM file('events.msgpack', MsgPack,
3 'id UInt64, country String, revenue Float64')"At a glance #
| Property | MessagePack |
|---|---|
| Layout | Binary, value-by-value (each value self-tagged) |
| Data model | Same as JSON: maps, arrays, strings, ints, floats, bool, null |
| Schema | None embedded — the reader must know the types |
| Compactness | Smaller than JSON; no field-name or whitespace overhead per value |
| Typical use | Wire protocols, caches (Redis, MongoDB), IoT, RPC, message queues |
| Open standard | Yes; implementations in 50+ languages |
How it is laid out #
JSON is text. The number 255 is three ASCII characters; true is four; every object repeats its field names in full, surrounded by quotes, braces and commas. MessagePack throws all of that away. Every value is a tag byte followed by its payload:
- A small integer like
5is a single byte. The tag is the value. 255is two bytes: auint8tag, then the byte0xFF.- A short string is a one-byte length-tag, then the raw UTF-8 bytes, with no quotes and no escaping.
- A
float64is a one-byte tag (0xcb) followed by 8 IEEE-754 bytes.
Because the tag encodes both type and size, a parser never has to scan ahead for a closing quote or brace. It reads the tag, knows exactly how many bytes follow, and jumps to the next value. That is what makes MessagePack fast to decode and why it is popular on the wire and in caches.
Describing the layout is one thing; seeing it in a real file is more convincing. Let's write a .msgpack file with clickhouse local. It's part of ClickHouse, installed with clickhousectl, the ClickHouse CLI for local and cloud:
1curl https://clickhouse.com/cli | sh # install clickhousectl
2clickhousectl local use latest # download ClickHouse and put it on your PATHGenerate a small demo file of event rows:
1clickhouse local -q "
2SELECT
3 number AS id,
4 ['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
5 ['mobile','desktop','tablet'][(number % 3) + 1] AS device,
6 ['view','click','purchase'][(number % 3) + 1] AS event_type,
7 round(randUniform(1, 500), 2) AS revenue,
8 toUInt8((number % 4) + 1) AS quantity
9FROM numbers(20)
10INTO OUTFILE 'events.msgpack'
11FORMAT MsgPack"Look at the bytes #
Dump the first rows with xxd and the structure is visible:
1xxd events.msgpack | head -4100000000: 00c4 0247 42c4 066d 6f62 696c 65c4 0476 ...GB..mobile..v
200000010: 6965 77cb 4069 38a3 d70a 3d71 0101 c402 iew.@i8...=q....
300000020: 4155 c407 6465 736b 746f 70c4 0563 6c69 AU..desktop..cli
400000030: 636b cb40 7d69 c28f 5c28 f602 02c4 0249 ck.@}i..\(.....IRead it left to right. 00 is the integer 0 (the first id), in a single byte. c4 02 is a binary blob of length 2, holding 47 42 = GB. c4 06 is a 6-byte blob, mobile. cb is the float64 tag; the eight bytes after it are the first revenue. There are no field names anywhere. country, device and the rest are not stored in the file. The data is positional. That compactness is the point, and it is also the catch.
The catch: no schema #
MessagePack does not record what its values mean. A reader sees "blob, blob, blob, float, int" but not the column names or even which value is which field. So when you point clickhouse local at a .msgpack file with no structure, it refuses:
1clickhouse local -q "SELECT * FROM file('events.msgpack', MsgPack) LIMIT 1"1Code: 636. DB::Exception: The table structure cannot be extracted from a MsgPack format file.
2Code: 36. DB::Exception: You must specify setting input_format_msgpack_number_of_columns
3to extract table schema from MsgPack data. (BAD_ARGUMENTS)This is unlike Parquet or Avro, which embed their own schema and need no hint (see what is a Parquet file). With MessagePack you supply the column names and types yourself, in order:
1clickhouse local -q "
2SELECT *
3FROM file('events.msgpack', MsgPack,
4 'id UInt64, country String, device String, event_type String, revenue Float64, quantity UInt8')
5LIMIT 5
6FORMAT Pretty"1 ┏━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
2 ┃ id ┃ country ┃ device ┃ event_type ┃ revenue ┃ quantity ┃
3 ┡━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
41. │ 0 │ GB │ mobile │ view │ 201.77 │ 1 │
52. │ 1 │ AU │ desktop │ click │ 470.61 │ 2 │
63. │ 2 │ IN │ tablet │ purchase │ 268.33 │ 3 │
74. │ 3 │ US │ mobile │ view │ 105.6 │ 4 │
85. │ 4 │ DE │ desktop │ click │ 178 │ 1 │
9 └────┴─────────┴─────────┴────────────┴─────────┴──────────┘The types must line up with what the writer put on the wire. Give it the wrong order or a wrong type and you get garbage or an error, because there is nothing in the file to check against. Knowing the schema out-of-band is the price MessagePack pays for being small.
Size on disk #
Those same 20 rows, written once as MessagePack and once as line-delimited JSON:
1events.msgpack 618 bytes
2events.jsonl 1883 bytesAbout a third of the size, and that gap is mostly the field names ("country":, "event_type": and so on) that JSON repeats on every row and MessagePack omits entirely. On wide, repetitive records the saving is larger. For columnar analytics on large datasets a format like Parquet compresses better still, but MessagePack wins where you need a quick, schema-light binary encoding of one record at a time, which is exactly where it shows up.
Where you'll meet it #
MessagePack turns up as a wire and storage format rather than a data-lake format: Redis modules, MongoDB-adjacent tooling, IoT and embedded telemetry, RPC frameworks, and message queues that want JSON's flexibility without JSON's byte count. If you also handle the BSON that MongoDB stores natively, that's a related but distinct binary-JSON format. See what is BSON.
Read one yourself #
You don't need a server or an import step. clickhouse local reads a .msgpack file in place from the command line, as long as you tell it the structure. To aggregate one — here over a 2,000,000-row file:
1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue)) AS revenue
3FROM file('events_large.msgpack', MsgPack,
4 'id UInt64, country String, device String, event_type String, revenue Float64, quantity UInt8')
5WHERE event_type = 'purchase'
6GROUP BY country
7ORDER BY revenue DESC"That scan-and-aggregate over 2,000,000 MessagePack rows runs in 0.276s (best of three, warm, on an Apple M4 Pro, 14 cores, 24 GB RAM; a number from one run, sensitive to concurrent load). The same SQL runs unchanged against a file on your laptop, a ClickHouse server, or ClickHouse Cloud, so a query you write against a local .msgpack file scales up without a rewrite.
For the full how-to, including casting and globbing many files, see read a MessagePack file, or to turn one into text see convert MessagePack to JSON.
Run it yourself: the data generator and every command above live in local-analytics/what-is-messagepack in the ClickHouse examples repo.
Prefer Python? → Read a MessagePack file in Python.