How to query a BSON file

To query a BSON file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the file with the file() function:

1clickhouse local -q "SELECT * FROM file('events.bson') LIMIT 10"

10	alice	view	5	('GB',1)

BSON is MongoDB's binary JSON, the format mongoexport writes, and it carries its own field names and types. ClickHouse reads it as BSONEachRow and infers the schema straight from the file, so you query it immediately with no import step.

See the schema without declaring one

A BSON document records the name and type of every field, so you never write CREATE TABLE. DESCRIBE prints what ClickHouse read out of the file:

1clickhouse local -q "DESCRIBE file('events.bson')"

1_id      Nullable(Int64)
2user     Nullable(String)
3event    Nullable(String)
4amount   Nullable(Float64)
5geo      Tuple(country Nullable(String), sessions Nullable(Int32))

Scalars map to the obvious ClickHouse types. The interesting one is geo: it was a nested sub-document in the BSON, and it comes back as a Tuple with its keys preserved as named fields. Columns are Nullable because a document collection can always omit a field.

Query into nested sub-documents

MongoDB documents nest. That's the whole point of them, and it's where flattening a BSON export to CSV first loses you information. Here ClickHouse kept the structure, so you address a sub-document field by key with dot notation (geo.country), no unpacking step:

1clickhouse local -q "
2SELECT _id, user, event, geo.country AS country, geo.sessions AS sessions
3FROM file('events.bson')
4LIMIT 5
5FORMAT PrettyCompact"

1┌─_id─┬─user──┬─event────┬─country─┬─sessions─┐
21. │   0 │ alice │ view     │ GB      │        1 │
32. │   1 │ bob   │ click    │ US      │        2 │
43. │   2 │ carol │ purchase │ DE      │        3 │
54. │   3 │ dave  │ refund   │ FR      │        1 │
65. │   4 │ alice │ view     │ IN      │        2 │
7   └─────┴───────┴──────────┴─────────┴──────────┘

A nested field reads just like a top-level one. (If a sub-document holds values that are all the same type, BSON serialises it as an array and ClickHouse infers Array(...) instead of a named Tuple; you'd then index it positionally as geo[1]. Mixed-type sub-documents keep their keys.)

The file is a SQL source, so the full ClickHouse dialect works on it (WHERE, GROUP BY, aggregates, joins) with no load step in between, and the nested key works inside the aggregation too:

1clickhouse local -q "
2SELECT geo.country AS country,
3       count()              AS events,
4       round(sum(amount), 2) AS amount
5FROM file('events.bson')
6WHERE event = 'purchase'
7GROUP BY country
8ORDER BY amount DESC
9FORMAT PrettyCompact"

1┌─country─┬─events─┬─amount─┐
21. │ FR      │      1 │  23.18 │
32. │ IN      │      1 │  19.14 │
43. │ GB      │      1 │   15.1 │
54. │ US      │      1 │  11.06 │
65. │ DE      │      1 │   7.02 │
7   └─────────┴────────┴────────┘

Convert BSON to a columnar file once

BSON is a row format built for document storage and transport, not for repeated analytical scans: every query walks each document field by field. If you'll query the same export more than once, convert it to Parquet once and read that instead. It's a single command — SELECT from the BSON, INTO OUTFILE as Parquet:

1clickhouse local -q "SELECT * FROM file('events.bson') INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"

The nested geo tuple maps cleanly to a Parquet struct, so the structure survives the round-trip. Read it back with the same file() call. See how to query a Parquet file and the convert BSON to JSON guide for other targets.

How fast is it?

The conversion you skipped was the slow part. On a 1,300,000-row, ~140 MB BSON file (events_large.bson), a filter-and-group-by over every document completes in:

1clickhouse local --time -q "
2SELECT geo.country, count(), round(sum(amount), 2)
3FROM file('events_large.bson')
4WHERE event = 'purchase'
5GROUP BY geo.country ORDER BY 3 DESC
6FORMAT Null"

10.138

~0.14 seconds of query execution (--time prints it), best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That decodes every BSON document from scratch on each run; there's no cached table. Measured end to end including process startup, run.sh reports wall-clock at about 0.28s. Convert the same data to Parquet and the scan gets faster again, because Parquet is columnar and reads only the fields a query touches.

The same SQL scales unchanged

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. You swap file('events.bson') for a table name and the rest stays put. Prototype against a mongoexport dump on your machine, ship the identical logic to production.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo. It has generate.sh (builds the demo document and the ~140 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-bson

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-bson
3./generate.sh && ./run.sh

See the schema without declaring one

Query into nested sub-documents

Filter, group and aggregate directly on the BSON

Convert BSON to a columnar file once

How fast is it?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

How to query a BSON file

See the schema without declaring one #

Query into nested sub-documents #

Filter, group and aggregate directly on the BSON #

Convert BSON to a columnar file once #

How fast is it? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

See the schema without declaring one

Query into nested sub-documents

Filter, group and aggregate directly on the BSON

Convert BSON to a columnar file once

How fast is it?

The same SQL scales unchanged

Run it yourself

Related