A BSON file holds data as Binary JSON: a typed, length-prefixed binary encoding of JSON-style documents, created for MongoDB as its on-disk and wire format. Each document is a length-prefixed run of fields, and every field carries a one-byte type tag before its name and value. You can read one in place with clickhouse local:
1clickhouse local -q "SELECT * FROM file('data.bson', BSONEachRow)"At a glance #
| Property | BSON |
|---|---|
| Full name | Binary JSON |
| Origin | MongoDB storage and wire format |
| Layout | Row-oriented: a stream of documents, one per record |
| Schema | Self-describing — each field carries a type tag |
| Encoding | Binary, length-prefixed (not text) |
| Types | Beyond JSON: int32, int64, double, datetime, binary, ObjectId, Decimal128 |
| Typical use | MongoDB exports, document stores, service interchange |
How BSON differs from JSON #
JSON is text. Every number is decimal digits, every value is delimited by punctuation, and a parser has to scan the whole document to find where things end. BSON keeps the same document-and-field model but encodes it as bytes:
- Length-prefixed. A document starts with its own total size as a 4-byte little-endian integer. A reader knows exactly where the document ends without scanning to the closing brace, so it can skip whole documents cheaply.
- Typed. Each field begins with a one-byte type tag (
0x01double,0x02UTF-8 string,0x10int32,0x12int64,0x08boolean,0x09UTC datetime, and so on), then the field name as a NUL-terminated string, then the value. The type is on disk, so there is no guessing whether42is an integer or a float. - Richer types than JSON. BSON adds datetimes, 64-bit integers, binary blobs,
ObjectId, andDecimal128, all of which JSON has to fake with strings.
The trade-off: BSON repeats every field name in every document, so for small scalar records it is often larger on disk than the equivalent JSON. Its win is fast, typed, skippable traversal, not compactness.
See the structure in a real file #
Describing the layout is one thing. Reading it from a real .bson file is more convincing. clickhouse local can both write and read BSON, and it's part of ClickHouse. Install it with clickhousectl, the ClickHouse CLI for local and cloud:
1curl https://clickhouse.com/cli | sh # install clickhousectl
2clickhousectl local use latest # download ClickHouse and put it on your PATHWrite five typed user documents to users.bson:
1clickhouse local -q "
2SELECT
3 number AS user_id,
4 ['alice','bob','carol','dave','erin'][(number % 5) + 1] AS name,
5 ['GB','US','DE','FR','IN'][(number % 5) + 1] AS country,
6 toDateTime('2026-01-01 00:00:00') + toIntervalHour(number) AS signup_time,
7 round(randUniform(1, 1000), 2) AS balance,
8 (number % 2 = 0) AS active
9FROM numbers(5)
10INTO OUTFILE 'users.bson'
11FORMAT BSONEachRow"Because BSON carries its own types, the read side needs no schema. SELECT * just works:
1clickhouse local -q "SELECT * FROM file('users.bson', BSONEachRow) FORMAT Pretty"1 ┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
2 ┃ user_id ┃ name ┃ country ┃ signup_time ┃ balance ┃ active ┃
3 ┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
41. │ 0 │ alice │ GB │ 1767225600 │ 822.28 │ 1 │
52. │ 1 │ bob │ US │ 1767229200 │ 196.84 │ 0 │
63. │ 2 │ carol │ DE │ 1767232800 │ 953.38 │ 1 │
74. │ 3 │ dave │ FR │ 1767236400 │ 350.99 │ 0 │
85. │ 4 │ erin │ IN │ 1767240000 │ 986.02 │ 1 │
9 └─────────┴───────┴─────────┴─────────────┴─────────┴────────┘DESCRIBE shows the schema ClickHouse inferred from the type tags. There is no header to read and no sampling to do; the types are in the bytes:
1clickhouse local -q "DESCRIBE file('users.bson', BSONEachRow)"1user_id Nullable(Int64)
2name Nullable(String)
3country Nullable(String)
4signup_time Nullable(Int64)
5balance Nullable(Float64)
6active Nullable(Int32)Two honest round-trip notes. BSON's datetime is read back as an Int64 of epoch seconds, and its boolean comes back as Int32. Both are faithful to the bytes; cast them if you want richer ClickHouse types: toDateTime(signup_time) and active::Bool.
The bytes, up close #
The structure is visible in a hex dump of the first document:
1xxd users.bson | head -4100000000: 6800 0000 1275 7365 725f 6964 0000 0000 h....user_id....
200000010: 0000 0000 0005 6e61 6d65 0005 0000 0000 ......name......
300000020: 616c 6963 6505 636f 756e 7472 7900 0200 alice.country...
400000030: 0000 0047 4212 7369 676e 7570 5f74 696d ...GB.signup_timRead it left to right: 68 00 00 00 is 0x68 = 104, the document's total length in bytes (little-endian). Then 12 is the type tag for int64, followed by the field name user_id and its 8-byte value. Next comes the name field, then country (you can see GB in the ASCII column), then signup_time. The length prefix and the per-field type tags are exactly what let a reader walk the document without parsing text.
Query it like any table #
Once read, a BSON file is just a table. Filter, group, and aggregate with ordinary SQL:
1clickhouse local -q "
2SELECT country, count() AS users, round(avg(balance), 2) AS avg_balance
3FROM file('users.bson', BSONEachRow)
4GROUP BY country
5ORDER BY country
6FORMAT Pretty"1 ┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
2 ┃ country ┃ users ┃ avg_balance ┃
3 ┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
41. │ DE │ 1 │ 953.38 │
52. │ FR │ 1 │ 350.99 │
63. │ GB │ 1 │ 822.28 │
74. │ IN │ 1 │ 986.02 │
85. │ US │ 1 │ 196.84 │
9 └─────────┴───────┴─────────────┘The same query runs over a real MongoDB export the same way — point file() at the .bson and go. On a 1,500,000-row, 111 MB BSON file, a GROUP BY over two columns finishes in 0.111s (best of three, warm cache, Apple M4 Pro, 24 GB RAM, macOS; may vary slightly under concurrent load). BSON is row-oriented, so the whole file is read and decoded for an aggregate. There are no column statistics to skip over the way Parquet has.
When to use BSON, and when to convert #
BSON earns its place as MongoDB's native format and as a wire format for services that already speak it. For repeated analytical querying, a columnar format is usually the better home: it's smaller, and an engine can read only the columns and row groups a query needs. A common pattern is to read the BSON once with clickhouse local and write Parquet for everything downstream.
Read one yourself #
You don't need a server or a MongoDB instance to inspect a .bson file. clickhouse local reads it in place from the command line. The same SQL works whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local BSON file scales up unchanged.
For a fuller walkthrough that covers globbing many exports, casting types on read, and joining BSON against other files, see how to query a BSON file.
Run it yourself: the data generator and every command above live in local-analytics/what-is-bson in the ClickHouse examples repo.
Prefer Python? → Read a BSON file in Python.