Convert BSON to CSV

To convert BSON to CSV, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the file and write CSV out:

1clickhouse local -q "SELECT * FROM file('events.bson') INTO OUTFILE 'events.csv' FORMAT CSVWithNames"

1"event_id","event_type","geo","amount"
21,"signup","{'city':'London','country':'GB'}",5
32,"purchase","{'city':'Berlin','country':'DE'}",6.01
43,"login","{'city':'Paris','country':'FR'}",7.02

ClickHouse reads the BSON documents in place, infers the schema from the field names and types each document carries, and streams CSV out without an import step or an upload. Files larger than RAM convert the same way.

That works, but look at the geo column. It is a whole sub-document crammed into one CSV cell. BSON nests; CSV is flat. The interesting part of this conversion is what happens to nested fields, so let's handle that properly.

The schema is inferred from the BSON

BSON (Binary JSON) is the format MongoDB stores and mongoexport emits. Each document carries its own field names and types, so clickhouse-local can read the structure without you describing it. Check what it found with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.bson')"

1event_id	Nullable(Int64)
2event_type	Nullable(String)
3geo	Map(String, Nullable(String))
4amount	Nullable(Float64)

Scalar fields map cleanly: an integer becomes Int64, a string becomes String, a double becomes Float64. The nested geo sub-document becomes a Map(String, ...). That is correct and lossless inside ClickHouse, but CSV has no concept of a map. When you write it out, the whole map is serialized into a single quoted text cell, as you saw above.

Flatten nested sub-documents into real columns

A flat CSV with one value per column is almost always what you want downstream. Pull the keys you care about out of the map and project them as their own columns in the SELECT:

1clickhouse local -q "
2SELECT
3  event_id,
4  event_type,
5  geo['city']    AS city,
6  geo['country'] AS country,
7  amount
8FROM file('events.bson')
9INTO OUTFILE 'events_flat.csv' FORMAT CSVWithNames"

1"event_id","event_type","city","country","amount"
21,"signup","London","GB",5
32,"purchase","Berlin","DE",6.01
43,"login","Paris","FR",7.02
54,"refund","Austin","US",8.03
65,"signup","London","GB",9.04

city and country are now first-class columns. The flat file reads back with a clean, fully typed schema, ready for any tool that speaks CSV:

1event_id	Nullable(Int64)
2event_type	Nullable(String)
3city	Nullable(String)
4country	Nullable(String)
5amount	Nullable(Float64)

The same approach handles arrays inside a document: index them (tags[1]), join them into a string (arrayStringConcat(tags, '|')), or arrayJoin them to fan one document out into multiple CSV rows. You decide how the nesting collapses, instead of getting a serialized blob you have to re-parse later.

Options worth knowing

This is where a scriptable converter beats a drag-and-drop website. You control the output, not a fixed template.

FORMAT CSVWithNames gives comma-separated output with a header row. Drop the header with FORMAT CSV, switch to tab-separated with FORMAT TSVWithNames, or override the separator with settings like format_csv_delimiter when a downstream tool is fussy.
Because it is one SQL statement, a WHERE event_type = 'purchase', a GROUP BY, or a renamed column costs nothing extra. You filter and reshape during the conversion rather than cleaning up in a second pass.
Inference reads a sample of documents. If a field that should stay a string (a ZIP code, a leading-zero id) gets read as a number, pass an explicit structure: file('events.bson', 'BSONEachRow', 'event_id String, ...').
The conversion streams documents through rather than building the whole dataset in memory, so a BSON export bigger than your RAM still converts.
An events.bson.gz input is decompressed on the fly from the extension, and an events.csv.gz output is gzipped on the way out, with no separate step.

Reversing the conversion is the same idea in the other direction: SELECT from the CSV and write FORMAT BSONEachRow.

How fast is it?

On a 1,400,000-row BSON export (events_large.bson, ~145 MB), flattening the nested geo document and writing the full CSV completes in:

1clickhouse local -q "
2SELECT event_id, event_type, geo['city'] AS city, geo['country'] AS country, amount
3FROM file('events_large.bson')
4INTO OUTFILE 'events_large.csv' FORMAT CSVWithNames"

~1.64 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That parses every BSON document and writes 1.4M CSV rows from scratch; there is no cached table. An upload-required website would still be uploading.

Do it in Python with chDB

If you live in Python, chDB is the same ClickHouse engine in-process. The SQL is identical, so the conversion and the flatten work exactly as above:

1import chdb
2
3chdb.query("""
4SELECT
5    event_id,
6    event_type,
7    geo['city']    AS city,
8    geo['country'] AS country,
9    amount
10FROM file('events.bson')
11INTO OUTFILE 'events_flat.csv' FORMAT CSVWithNames
12""")

1"event_id","event_type","city","country","amount"
21,"signup","London","GB",5
32,"purchase","Berlin","DE",6.01
43,"login","Paris","FR",7.02

To keep the result in memory instead of on disk, ask chDB for a DataFrame: chdb.query(sql, "DataFrame") (see reading a BSON file in Python with chDB).

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh creates the sample BSON files (including the ~145 MB file used for the timing above), run.sh and run.py run every command on this page, and expected_output.txt is the captured output.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-bson-to-csv

The same SQL that converts a file on your laptop runs unchanged against a directory of files, a remote object store, or a ClickHouse server and ClickHouse Cloud when the data outgrows one machine. To query the BSON in place instead of converting it, see how to query a BSON file; to keep the data as documents, see convert BSON to JSON; and for the CSV side, how to run SQL on a CSV file.

The schema is inferred from the BSON

Flatten nested sub-documents into real columns

Options worth knowing

How fast is it?

Do it in Python with chDB

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

Convert BSON to CSV

The schema is inferred from the BSON #

Flatten nested sub-documents into real columns #

Options worth knowing #

How fast is it? #

Do it in Python with chDB #

Run it yourself #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

The schema is inferred from the BSON

Flatten nested sub-documents into real columns

Options worth knowing

How fast is it?

Do it in Python with chDB

Run it yourself