To read a MessagePack file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then pass the format and the column types, and query the file directly:
clickhouse local -q "SELECT * FROM file('events.msgpack', MsgPack,
'id UInt64, country String, event_type String, revenue Float64, quantity UInt8') LIMIT 5"
┌─id─┬─country─┬─event_type─┬─revenue─┬─quantity─┐
1. │ 1 │ GB │ click │ 5 │ 1 │
2. │ 2 │ US │ view │ 6.01 │ 2 │
3. │ 3 │ DE │ purchase │ 7.02 │ 3 │
4. │ 4 │ FR │ refund │ 8.03 │ 4 │
5. │ 5 │ IN │ click │ 9.04 │ 5 │
└────┴─────────┴────────────┴─────────┴──────────┘
MessagePack carries no column names or type information, so you supply the structure yourself in the file() call. The file is read in place with no import step.
Formats like Parquet, Avro or ORC embed their column names and types, so ClickHouse can read them with a bare file('data.parquet'). MessagePack does not. A .msgpack file is a flat stream of typed values with nothing that says which value is which column. (What is MessagePack?)
So a read without a structure fails, on purpose:
clickhouse local -q "SELECT * FROM file('events.msgpack') LIMIT 3"
Code: 636. DB::Exception: The table structure cannot be extracted from a MsgPack format file.
You must specify setting input_format_msgpack_number_of_columns to extract table schema from MsgPack data.
You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)
The fix is the three-argument form of file(): path, format, structure. You supply the column names and types in declaration order, and they must match how the file was written:
clickhouse local -q "
SELECT * FROM file('events.msgpack', MsgPack,
'id UInt64, country String, event_type String, revenue Float64, quantity UInt8')
LIMIT 5
FORMAT PrettyCompact"
That structure is the whole contract. Get the order or the types right and every value lands in the right column; get them wrong and you get garbage or a parse error, because there is no header to fall back on. If you only know the column count and not the names, set input_format_msgpack_number_of_columns and let ClickHouse name them c1, c2, ... instead.
Filter, aggregate, and group by
Once the structure is declared, the file behaves like any other SQL source. You have the full ClickHouse dialect (WHERE, GROUP BY, aggregate and window functions, joins) with no load step:
clickhouse local -q "
SELECT country,
count() AS purchases,
round(sum(revenue), 2) AS revenue,
round(avg(quantity), 3) AS avg_qty
FROM file('events.msgpack', MsgPack,
'id UInt64, country String, event_type String, revenue Float64, quantity UInt8')
WHERE event_type = 'purchase'
GROUP BY country
ORDER BY revenue DESC
FORMAT PrettyCompact"
┌─country─┬─purchases─┬─revenue─┬─avg_qty─┐
1. │ GB │ 2 │ 34.24 │ 3 │
2. │ DE │ 2 │ 26.16 │ 4 │
3. │ IN │ 1 │ 15.1 │ 1 │
└─────────┴───────────┴─────────┴─────────┘
Retyping that structure on every query gets old fast. Read the file once, write it to a self-describing format, and query that from then on. Converting to Parquet is a single command (SELECT ... INTO OUTFILE 'events.parquet' FORMAT Parquet), after which file('events.parquet') needs no structure at all. See how to query a Parquet file.
You do not unzip anything first. clickhouse local reads the .gz suffix and decompresses on the fly, so a .msgpack.gz file is queried exactly like a .msgpack (the structure argument is still required, since compression does not add a schema):
clickhouse local -q "
SELECT count(), round(sum(revenue), 2)
FROM file('events.msgpack.gz', MsgPack,
'id UInt64, country String, event_type String, revenue Float64, quantity UInt8')
FORMAT PrettyCompact"
┌─count()─┬─round(sum(revenue), 2)─┐
1. │ 20 │ 291.9 │
└─────────┴────────────────────────┘
The same applies to .msgpack.zst, .msgpack.lz4 and other supported codecs. The compression is inferred from the file name.
MessagePack is a compact binary format, so there is no text to parse and no type to guess at read time. On a 3,000,000-row, ~80 MB file (events_large.msgpack), a filter-and-group-by over every row runs in:
clickhouse local --time -q "
SELECT country, count(), round(sum(revenue), 2), round(avg(quantity), 3)
FROM file('events_large.msgpack', MsgPack,
'id UInt64, country String, event_type String, revenue Float64, quantity UInt8')
WHERE event_type = 'purchase'
GROUP BY country ORDER BY 2 DESC
FORMAT Null"
~0.47 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). The decode runs across all cores, so a multi-million-row aggregation finishes before you can switch windows.
The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes; you swap the file() call for a table name and the rest stays put. You prototype against a MessagePack file on your machine and ship the identical logic to production.
The complete, runnable example lives here. It has generate.sh (builds the demo file, a gzipped copy, and the 3M-row perf file), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-messagepack
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-messagepack
./generate.sh && ./run.sh