How to query a BSON file

Al Brown
Last updated: Jun 15, 2026

To query a BSON file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

curl https://clickhouse.com/cli | sh   # install clickhousectl
clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the file with the file() function:

clickhouse local -q "SELECT * FROM file('events.bson') LIMIT 10"
0	alice	view	5	('GB',1)

BSON is MongoDB's binary JSON, the format mongoexport writes, and it carries its own field names and types. ClickHouse reads it as BSONEachRow and infers the schema straight from the file, so you query it immediately with no import step.

See the schema without declaring one

A BSON document records the name and type of every field, so you never write CREATE TABLE. DESCRIBE prints what ClickHouse read out of the file:

clickhouse local -q "DESCRIBE file('events.bson')"
_id      Nullable(Int64)
user     Nullable(String)
event    Nullable(String)
amount   Nullable(Float64)
geo      Tuple(country Nullable(String), sessions Nullable(Int32))

Scalars map to the obvious ClickHouse types. The interesting one is geo: it was a nested sub-document in the BSON, and it comes back as a Tuple with its keys preserved as named fields. Columns are Nullable because a document collection can always omit a field.

Query into nested sub-documents

MongoDB documents nest. That's the whole point of them, and it's where flattening a BSON export to CSV first loses you information. Here ClickHouse kept the structure, so you address a sub-document field by key with dot notation (geo.country), no unpacking step:

clickhouse local -q "
SELECT _id, user, event, geo.country AS country, geo.sessions AS sessions
FROM file('events.bson')
LIMIT 5
FORMAT PrettyCompact"
┌─_id─┬─user──┬─event────┬─country─┬─sessions─┐
1. │   0 │ alice │ view     │ GB      │        1 │
2. │   1 │ bob   │ click    │ US      │        2 │
3. │   2 │ carol │ purchase │ DE      │        3 │
4. │   3 │ dave  │ refund   │ FR      │        1 │
5. │   4 │ alice │ view     │ IN      │        2 │
   └─────┴───────┴──────────┴─────────┴──────────┘

A nested field reads just like a top-level one. (If a sub-document holds values that are all the same type, BSON serialises it as an array and ClickHouse infers Array(...) instead of a named Tuple; you'd then index it positionally as geo[1]. Mixed-type sub-documents keep their keys.)

Filter, group and aggregate directly on the BSON

The file is a SQL source, so the full ClickHouse dialect works on it (WHERE, GROUP BY, aggregates, joins) with no load step in between, and the nested key works inside the aggregation too:

clickhouse local -q "
SELECT geo.country AS country,
       count()              AS events,
       round(sum(amount), 2) AS amount
FROM file('events.bson')
WHERE event = 'purchase'
GROUP BY country
ORDER BY amount DESC
FORMAT PrettyCompact"
┌─country─┬─events─┬─amount─┐
1. │ FR      │      1 │  23.18 │
2. │ IN      │      1 │  19.14 │
3. │ GB      │      1 │   15.1 │
4. │ US      │      1 │  11.06 │
5. │ DE      │      1 │   7.02 │
   └─────────┴────────┴────────┘

Convert BSON to a columnar file once

BSON is a row format built for document storage and transport, not for repeated analytical scans: every query walks each document field by field. If you'll query the same export more than once, convert it to Parquet once and read that instead. It's a single command — SELECT from the BSON, INTO OUTFILE as Parquet:

clickhouse local -q "SELECT * FROM file('events.bson') INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"

The nested geo tuple maps cleanly to a Parquet struct, so the structure survives the round-trip. Read it back with the same file() call. See how to query a Parquet file and the convert BSON to JSON guide for other targets.

How fast is it?

The conversion you skipped was the slow part. On a 1,300,000-row, ~140 MB BSON file (events_large.bson), a filter-and-group-by over every document completes in:

clickhouse local --time -q "
SELECT geo.country, count(), round(sum(amount), 2)
FROM file('events_large.bson')
WHERE event = 'purchase'
GROUP BY geo.country ORDER BY 3 DESC
FORMAT Null"
0.138

~0.14 seconds of query execution (--time prints it), best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That decodes every BSON document from scratch on each run; there's no cached table. Measured end to end including process startup, run.sh reports wall-clock at about 0.28s. Convert the same data to Parquet and the scan gets faster again, because Parquet is columnar and reads only the fields a query touches.

The same SQL scales unchanged

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. You swap file('events.bson') for a table name and the rest stays put. Prototype against a mongoexport dump on your machine, ship the identical logic to production.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo. It has generate.sh (builds the demo document and the ~140 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-bson

git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-bson
./generate.sh && ./run.sh

Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->