Convert BSON to CSV

Al Brown
Last updated: Jun 15, 2026

To convert BSON to CSV, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

curl https://clickhouse.com/cli | sh   # install clickhousectl
clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the file and write CSV out:

clickhouse local -q "SELECT * FROM file('events.bson') INTO OUTFILE 'events.csv' FORMAT CSVWithNames"
"event_id","event_type","geo","amount"
1,"signup","{'city':'London','country':'GB'}",5
2,"purchase","{'city':'Berlin','country':'DE'}",6.01
3,"login","{'city':'Paris','country':'FR'}",7.02

ClickHouse reads the BSON documents in place, infers the schema from the field names and types each document carries, and streams CSV out without an import step or an upload. Files larger than RAM convert the same way.

That works, but look at the geo column. It is a whole sub-document crammed into one CSV cell. BSON nests; CSV is flat. The interesting part of this conversion is what happens to nested fields, so let's handle that properly.

The schema is inferred from the BSON

BSON (Binary JSON) is the format MongoDB stores and mongoexport emits. Each document carries its own field names and types, so clickhouse-local can read the structure without you describing it. Check what it found with DESCRIBE:

clickhouse local -q "DESCRIBE file('events.bson')"
event_id	Nullable(Int64)
event_type	Nullable(String)
geo	Map(String, Nullable(String))
amount	Nullable(Float64)

Scalar fields map cleanly: an integer becomes Int64, a string becomes String, a double becomes Float64. The nested geo sub-document becomes a Map(String, ...). That is correct and lossless inside ClickHouse, but CSV has no concept of a map. When you write it out, the whole map is serialized into a single quoted text cell, as you saw above.

Flatten nested sub-documents into real columns

A flat CSV with one value per column is almost always what you want downstream. Pull the keys you care about out of the map and project them as their own columns in the SELECT:

clickhouse local -q "
SELECT
  event_id,
  event_type,
  geo['city']    AS city,
  geo['country'] AS country,
  amount
FROM file('events.bson')
INTO OUTFILE 'events_flat.csv' FORMAT CSVWithNames"
"event_id","event_type","city","country","amount"
1,"signup","London","GB",5
2,"purchase","Berlin","DE",6.01
3,"login","Paris","FR",7.02
4,"refund","Austin","US",8.03
5,"signup","London","GB",9.04

city and country are now first-class columns. The flat file reads back with a clean, fully typed schema, ready for any tool that speaks CSV:

event_id	Nullable(Int64)
event_type	Nullable(String)
city	Nullable(String)
country	Nullable(String)
amount	Nullable(Float64)

The same approach handles arrays inside a document: index them (tags[1]), join them into a string (arrayStringConcat(tags, '|')), or arrayJoin them to fan one document out into multiple CSV rows. You decide how the nesting collapses, instead of getting a serialized blob you have to re-parse later.

Options worth knowing

This is where a scriptable converter beats a drag-and-drop website. You control the output, not a fixed template.

  • FORMAT CSVWithNames gives comma-separated output with a header row. Drop the header with FORMAT CSV, switch to tab-separated with FORMAT TSVWithNames, or override the separator with settings like format_csv_delimiter when a downstream tool is fussy.
  • Because it is one SQL statement, a WHERE event_type = 'purchase', a GROUP BY, or a renamed column costs nothing extra. You filter and reshape during the conversion rather than cleaning up in a second pass.
  • Inference reads a sample of documents. If a field that should stay a string (a ZIP code, a leading-zero id) gets read as a number, pass an explicit structure: file('events.bson', 'BSONEachRow', 'event_id String, ...').
  • The conversion streams documents through rather than building the whole dataset in memory, so a BSON export bigger than your RAM still converts.
  • An events.bson.gz input is decompressed on the fly from the extension, and an events.csv.gz output is gzipped on the way out, with no separate step.

Reversing the conversion is the same idea in the other direction: SELECT from the CSV and write FORMAT BSONEachRow.

How fast is it?

On a 1,400,000-row BSON export (events_large.bson, ~145 MB), flattening the nested geo document and writing the full CSV completes in:

clickhouse local -q "
SELECT event_id, event_type, geo['city'] AS city, geo['country'] AS country, amount
FROM file('events_large.bson')
INTO OUTFILE 'events_large.csv' FORMAT CSVWithNames"

~1.64 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That parses every BSON document and writes 1.4M CSV rows from scratch; there is no cached table. An upload-required website would still be uploading.

Do it in Python with chDB

If you live in Python, chDB is the same ClickHouse engine in-process. The SQL is identical, so the conversion and the flatten work exactly as above:

import chdb

chdb.query("""
SELECT
    event_id,
    event_type,
    geo['city']    AS city,
    geo['country'] AS country,
    amount
FROM file('events.bson')
INTO OUTFILE 'events_flat.csv' FORMAT CSVWithNames
""")
"event_id","event_type","city","country","amount"
1,"signup","London","GB",5
2,"purchase","Berlin","DE",6.01
3,"login","Paris","FR",7.02

To keep the result in memory instead of on disk, ask chDB for a DataFrame: chdb.query(sql, "DataFrame") (see reading a BSON file in Python with chDB).

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh creates the sample BSON files (including the ~145 MB file used for the timing above), run.sh and run.py run every command on this page, and expected_output.txt is the captured output.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-bson-to-csv

The same SQL that converts a file on your laptop runs unchanged against a directory of files, a remote object store, or a ClickHouse server and ClickHouse Cloud when the data outgrows one machine. To query the BSON in place instead of converting it, see how to query a BSON file; to keep the data as documents, see convert BSON to JSON; and for the CSV side, how to run SQL on a CSV file.


Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->