What is a BSON file?

A BSON file holds data as Binary JSON: a typed, length-prefixed binary encoding of JSON-style documents, created for MongoDB as its on-disk and wire format. Each document is a length-prefixed run of fields, and every field carries a one-byte type tag before its name and value. You can read one in place with clickhouse local:

1clickhouse local -q "SELECT * FROM file('data.bson', BSONEachRow)"

At a glance

Property	BSON
Full name	Binary JSON
Origin	MongoDB storage and wire format
Layout	Row-oriented: a stream of documents, one per record
Schema	Self-describing — each field carries a type tag
Encoding	Binary, length-prefixed (not text)
Types	Beyond JSON: int32, int64, double, datetime, binary, ObjectId, Decimal128
Typical use	MongoDB exports, document stores, service interchange

How BSON differs from JSON

JSON is text. Every number is decimal digits, every value is delimited by punctuation, and a parser has to scan the whole document to find where things end. BSON keeps the same document-and-field model but encodes it as bytes:

Length-prefixed. A document starts with its own total size as a 4-byte little-endian integer. A reader knows exactly where the document ends without scanning to the closing brace, so it can skip whole documents cheaply.
Typed. Each field begins with a one-byte type tag (0x01 double, 0x02 UTF-8 string, 0x10 int32, 0x12 int64, 0x08 boolean, 0x09 UTC datetime, and so on), then the field name as a NUL-terminated string, then the value. The type is on disk, so there is no guessing whether 42 is an integer or a float.
Richer types than JSON. BSON adds datetimes, 64-bit integers, binary blobs, ObjectId, and Decimal128, all of which JSON has to fake with strings.

The trade-off: BSON repeats every field name in every document, so for small scalar records it is often larger on disk than the equivalent JSON. Its win is fast, typed, skippable traversal, not compactness.

See the structure in a real file

Describing the layout is one thing. Reading it from a real .bson file is more convincing. clickhouse local can both write and read BSON, and it's part of ClickHouse. Install it with clickhousectl, the ClickHouse CLI for local and cloud:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Write five typed user documents to users.bson:

1clickhouse local -q "
2SELECT
3    number AS user_id,
4    ['alice','bob','carol','dave','erin'][(number % 5) + 1] AS name,
5    ['GB','US','DE','FR','IN'][(number % 5) + 1] AS country,
6    toDateTime('2026-01-01 00:00:00') + toIntervalHour(number) AS signup_time,
7    round(randUniform(1, 1000), 2) AS balance,
8    (number % 2 = 0) AS active
9FROM numbers(5)
10INTO OUTFILE 'users.bson'
11FORMAT BSONEachRow"

Because BSON carries its own types, the read side needs no schema. SELECT * just works:

1clickhouse local -q "SELECT * FROM file('users.bson', BSONEachRow) FORMAT Pretty"

1┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
2   ┃ user_id ┃ name  ┃ country ┃ signup_time ┃ balance ┃ active ┃
3   ┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
41. │       0 │ alice │ GB      │  1767225600 │  822.28 │      1 │
52. │       1 │ bob   │ US      │  1767229200 │  196.84 │      0 │
63. │       2 │ carol │ DE      │  1767232800 │  953.38 │      1 │
74. │       3 │ dave  │ FR      │  1767236400 │  350.99 │      0 │
85. │       4 │ erin  │ IN      │  1767240000 │  986.02 │      1 │
9   └─────────┴───────┴─────────┴─────────────┴─────────┴────────┘

DESCRIBE shows the schema ClickHouse inferred from the type tags. There is no header to read and no sampling to do; the types are in the bytes:

1clickhouse local -q "DESCRIBE file('users.bson', BSONEachRow)"

1user_id      Nullable(Int64)
2name         Nullable(String)
3country      Nullable(String)
4signup_time  Nullable(Int64)
5balance      Nullable(Float64)
6active       Nullable(Int32)

Two honest round-trip notes. BSON's datetime is read back as an Int64 of epoch seconds, and its boolean comes back as Int32. Both are faithful to the bytes; cast them if you want richer ClickHouse types: toDateTime(signup_time) and active::Bool.

The bytes, up close

The structure is visible in a hex dump of the first document:

1xxd users.bson | head -4

100000000: 6800 0000 1275 7365 725f 6964 0000 0000  h....user_id....
200000010: 0000 0000 0005 6e61 6d65 0005 0000 0000  ......name......
300000020: 616c 6963 6505 636f 756e 7472 7900 0200  alice.country...
400000030: 0000 0047 4212 7369 676e 7570 5f74 696d  ...GB.signup_tim

Read it left to right: 68 00 00 00 is 0x68 = 104, the document's total length in bytes (little-endian). Then 12 is the type tag for int64, followed by the field name user_id and its 8-byte value. Next comes the name field, then country (you can see GB in the ASCII column), then signup_time. The length prefix and the per-field type tags are exactly what let a reader walk the document without parsing text.

Query it like any table

Once read, a BSON file is just a table. Filter, group, and aggregate with ordinary SQL:

1clickhouse local -q "
2SELECT country, count() AS users, round(avg(balance), 2) AS avg_balance
3FROM file('users.bson', BSONEachRow)
4GROUP BY country
5ORDER BY country
6FORMAT Pretty"

1┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
2   ┃ country ┃ users ┃ avg_balance ┃
3   ┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
41. │ DE      │     1 │      953.38 │
52. │ FR      │     1 │      350.99 │
63. │ GB      │     1 │      822.28 │
74. │ IN      │     1 │      986.02 │
85. │ US      │     1 │      196.84 │
9   └─────────┴───────┴─────────────┘

The same query runs over a real MongoDB export the same way — point file() at the .bson and go. On a 1,500,000-row, 111 MB BSON file, a GROUP BY over two columns finishes in 0.111s (best of three, warm cache, Apple M4 Pro, 24 GB RAM, macOS; may vary slightly under concurrent load). BSON is row-oriented, so the whole file is read and decoded for an aggregate. There are no column statistics to skip over the way Parquet has.

When to use BSON, and when to convert

BSON earns its place as MongoDB's native format and as a wire format for services that already speak it. For repeated analytical querying, a columnar format is usually the better home: it's smaller, and an engine can read only the columns and row groups a query needs. A common pattern is to read the BSON once with clickhouse local and write Parquet for everything downstream.

Read one yourself

You don't need a server or a MongoDB instance to inspect a .bson file. clickhouse local reads it in place from the command line. The same SQL works whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local BSON file scales up unchanged.

For a fuller walkthrough that covers globbing many exports, casting types on read, and joining BSON against other files, see how to query a BSON file.

Run it yourself: the data generator and every command above live in local-analytics/what-is-bson in the ClickHouse examples repo.

Prefer Python? → Read a BSON file in Python.

At a glance

How BSON differs from JSON

See the structure in a real file

The bytes, up close

Query it like any table

When to use BSON, and when to convert

Read one yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

What is a BSON file?

At a glance #

How BSON differs from JSON #

See the structure in a real file #

The bytes, up close #

Query it like any table #

When to use BSON, and when to convert #

Read one yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

At a glance

How BSON differs from JSON

See the structure in a real file

The bytes, up close

Query it like any table

When to use BSON, and when to convert

Read one yourself