Protobuf (Protocol Buffers) is an open, schema-first binary serialization format from Google for structured data. You define your message types once in a .proto file, then encode records as compact binary. The schema is not stored in the data, so you need the matching .proto to read it back. To read one with SQL: clickhouse local -q "SELECT * FROM file('events.bin', Protobuf) SETTINGS format_schema='events.proto:Event'".
| Property | Protobuf |
|---|
| Layout | Row-oriented binary (one encoded message per record) |
| Schema | External .proto file; not embedded in the data |
| Identity on the wire | Field numbers, not field names |
| Compression | None built in (pair it with gzip/zstd at the transport or file layer) |
| Typical use | RPC payloads (gRPC), service-to-service messages, event streams |
| Open standard | Apache-licensed; libraries for most languages |
Protobuf starts with a schema. You write a .proto file describing one or more messages, where each message is a set of typed fields and each field has a number:
Those numbers (= 1, = 2) are the part that matters. On the wire, Protobuf does not write field names. It writes, for each field, a small tag combining the field number and a wire type, followed by the value. Integers use a variable-length encoding (varint), so small numbers cost one byte. Strings and nested messages are length-prefixed. The result is compact and fast to parse, but opaque: without the .proto, a Protobuf payload is just a run of tagged bytes with no self-describing column names.
That design is the whole trade-off. Formats like Parquet and Avro carry their schema with the data, so a reader can open them cold. Protobuf moves the schema out of band to keep messages tiny, which suits RPC and streaming where both ends already share the contract.
Prove it: write and read a real Protobuf file
clickhouse local speaks Protobuf in both directions. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:
Save the .proto above as events.proto, then write 20 rows to a binary file. The format_schema setting takes <file>:<MessageName>:
Now the schema requirement, made concrete. Read the file without the schema and ClickHouse refuses, because there is nothing in the bytes to describe them:
Supply the same .proto and the bytes become rows:
ClickHouse maps each .proto field to a column by name and derives the column types from the field types. DESCRIBE shows the schema it built from events.proto:
proto3 uint64 becomes UInt64, double becomes Float64, and so on. From here it's a normal table you can filter and aggregate:
Read the raw first bytes and you can see the framing with no field names in sight:
The leading 1D is a length prefix: 29 bytes for the first message. 1202 4742 is field 2 (country), length 2, the ASCII GB. 1A06 6D6F62696C65 is field 3 (device), length 6, mobile. The reader walks tags and lengths; the meaning of "field 2" comes entirely from the .proto. Lose the schema and you keep the bytes but lose the labels, which is exactly why the schema is mandatory.
Protobuf is built for fast decode, and reading it with SQL stays fast. On a 3,000,000-row Protobuf file (104 MB), clickhouse local scans, decodes, filters and groups by country in 0.33s (best of three, warm, Apple M4 Pro, 14 cores, 24 GB RAM; may be affected by concurrent load). Decoding still happens row by row, so a columnar file like Parquet will usually beat it for repeated analytical scans. For moving structured records between services, Protobuf's compactness is the point.
You don't need a server or a build step to read Protobuf. clickhouse local reads it in place from the command line, as long as you hand it the .proto. Because the schema is external, keep the .proto next to (or versioned with) the data, and pass it via SETTINGS format_schema = '<file>:<MessageName>'.
The same SQL works whether you point ClickHouse at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local Protobuf file scales up unchanged.
If you're comparing binary formats, two close relatives also embed their schema (so they read cold) and are worth a look: Avro and BSON.
Run it yourself: the .proto, data generator and every command above live in local-analytics/what-is-protobuf in the ClickHouse examples repo.