What is Protobuf?

Al Brown
Last updated: Jun 15, 2026

Protobuf (Protocol Buffers) is an open, schema-first binary serialization format from Google for structured data. You define your message types once in a .proto file, then encode records as compact binary. The schema is not stored in the data, so you need the matching .proto to read it back. To read one with SQL: clickhouse local -q "SELECT * FROM file('events.bin', Protobuf) SETTINGS format_schema='events.proto:Event'".

At a glance #

PropertyProtobuf
LayoutRow-oriented binary (one encoded message per record)
SchemaExternal .proto file; not embedded in the data
Identity on the wireField numbers, not field names
CompressionNone built in (pair it with gzip/zstd at the transport or file layer)
Typical useRPC payloads (gRPC), service-to-service messages, event streams
Open standardApache-licensed; libraries for most languages

How Protobuf is built #

Protobuf starts with a schema. You write a .proto file describing one or more messages, where each message is a set of typed fields and each field has a number:

1syntax = "proto3";
2
3message Event {
4  uint64 id         = 1;
5  string country    = 2;
6  string device     = 3;
7  string event_type = 4;
8  double revenue    = 5;
9  uint32 quantity   = 6;
10}

Those numbers (= 1, = 2) are the part that matters. On the wire, Protobuf does not write field names. It writes, for each field, a small tag combining the field number and a wire type, followed by the value. Integers use a variable-length encoding (varint), so small numbers cost one byte. Strings and nested messages are length-prefixed. The result is compact and fast to parse, but opaque: without the .proto, a Protobuf payload is just a run of tagged bytes with no self-describing column names.

That design is the whole trade-off. Formats like Parquet and Avro carry their schema with the data, so a reader can open them cold. Protobuf moves the schema out of band to keep messages tiny, which suits RPC and streaming where both ends already share the contract.

Prove it: write and read a real Protobuf file #

clickhouse local speaks Protobuf in both directions. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Save the .proto above as events.proto, then write 20 rows to a binary file. The format_schema setting takes <file>:<MessageName>:

1clickhouse local -q "
2SELECT
3    number AS id,
4    ['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
5    ['mobile','desktop','tablet'][(number % 3) + 1] AS device,
6    ['view','click','purchase'][(number % 3) + 1] AS event_type,
7    round(randUniform(1, 500), 2) AS revenue,
8    toUInt32((number % 4) + 1) AS quantity
9FROM numbers(20)
10INTO OUTFILE 'events.bin'
11FORMAT Protobuf
12SETTINGS format_schema = 'events.proto:Event'"

Now the schema requirement, made concrete. Read the file without the schema and ClickHouse refuses, because there is nothing in the bytes to describe them:

1clickhouse local -q "SELECT count() FROM file('events.bin', Protobuf)"
1Code: 36. DB::Exception: The format Protobuf requires a schema. The corresponding setting should be set. (BAD_ARGUMENTS)

Supply the same .proto and the bytes become rows:

1clickhouse local -q "
2SELECT *
3FROM file('events.bin', Protobuf)
4ORDER BY id
5LIMIT 5
6SETTINGS format_schema = 'events.proto:Event'
7FORMAT Pretty"
1   ┏━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
2   ┃ id ┃ country ┃ device  ┃ event_type ┃ revenue ┃ quantity ┃
3   ┡━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
41.0 │ GB      │ mobile  │ view4.69152.1 │ AU      │ desktop │ click      │   38.87263.2IN      │ tablet  │ purchase   │  136.51374.3 │ US      │ mobile  │ view462.63485.4 │ DE      │ desktop │ click      │   250.319   └────┴─────────┴─────────┴────────────┴─────────┴──────────┘

ClickHouse maps each .proto field to a column by name and derives the column types from the field types. DESCRIBE shows the schema it built from events.proto:

1clickhouse local -q "
2DESCRIBE file('events.bin', Protobuf)
3SETTINGS format_schema = 'events.proto:Event'
4FORMAT PrettyCompactNoEscapes"
1   ┌─name───────┬─type────┐
21. │ id         │ UInt64  │
32. │ country    │ String  │
43. │ device     │ String  │
54. │ event_type │ String  │
65. │ revenue    │ Float64 │
76. │ quantity   │ UInt32  │
8   └────────────┴─────────┘

proto3 uint64 becomes UInt64, double becomes Float64, and so on. From here it's a normal table you can filter and aggregate:

1clickhouse local -q "
2SELECT country, count() AS purchases, round(sum(revenue)) AS revenue
3FROM file('events.bin', Protobuf)
4WHERE event_type = 'purchase'
5GROUP BY country
6ORDER BY revenue DESC
7SETTINGS format_schema = 'events.proto:Event'
8FORMAT Pretty"
1   ┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┓
2   ┃ country ┃ purchases ┃ revenue ┃
3   ┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━┩
41. │ AU      │         1 │     473 │
52. │ DE      │         1 │     339 │
63. │ US      │         1 │     262 │
74. │ IN      │         2 │     203 │
85. │ GB      │         1 │      55 │
9   └─────────┴───────────┴─────────┘

What the wire actually looks like #

Read the raw first bytes and you can see the framing with no field names in sight:

1clickhouse local -q "SELECT hex(substring(file('events.bin'), 1, 32))"
11D120247421A066D6F62696C6522047669657729C3F5285C8FC2124030012108

The leading 1D is a length prefix: 29 bytes for the first message. 1202 4742 is field 2 (country), length 2, the ASCII GB. 1A06 6D6F62696C65 is field 3 (device), length 6, mobile. The reader walks tags and lengths; the meaning of "field 2" comes entirely from the .proto. Lose the schema and you keep the bytes but lose the labels, which is exactly why the schema is mandatory.

Performance #

Protobuf is built for fast decode, and reading it with SQL stays fast. On a 3,000,000-row Protobuf file (104 MB), clickhouse local scans, decodes, filters and groups by country in 0.33s (best of three, warm, Apple M4 Pro, 14 cores, 24 GB RAM; may be affected by concurrent load). Decoding still happens row by row, so a columnar file like Parquet will usually beat it for repeated analytical scans. For moving structured records between services, Protobuf's compactness is the point.

Read one yourself #

You don't need a server or a build step to read Protobuf. clickhouse local reads it in place from the command line, as long as you hand it the .proto. Because the schema is external, keep the .proto next to (or versioned with) the data, and pass it via SETTINGS format_schema = '<file>:<MessageName>'.

The same SQL works whether you point ClickHouse at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local Protobuf file scales up unchanged.

If you're comparing binary formats, two close relatives also embed their schema (so they read cold) and are worth a look: Avro and BSON.

Run it yourself: the .proto, data generator and every command above live in local-analytics/what-is-protobuf in the ClickHouse examples repo.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...