How to flatten nested JSON in Python

To flatten nested JSON in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_json("data/orders.json")
3flat = df.explode("items")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

The shape we're flattening

A common nested document has an outer object with a nested object field and an array-of-objects field:

1[
2  {"order_id": 1, "customer": {"id": 101, "country": "GB", "tier": "gold"},
3   "items": [{"sku": "A-1", "qty": 2, "price": 9.99}, {"sku": "B-2", "qty": 1, "price": 19.50}]},
4  {"order_id": 2, "customer": {"id": 102, "country": "AU", "tier": "silver"},
5   "items": [{"sku": "C-3", "qty": 5, "price": 4.00}]},
6  ...
7]

The goal is a flat table: one row per line item, with the parent fields (order_id, customer.country, customer.tier) repeated on each row.

Read a nested JSON file into a DataFrame

pd.read_json reads the file and returns a lazy, ClickHouse-backed DataFrame:

1import chdb.datastore as pd
2
3df = pd.read_json("data/orders.json")
4print(df)
5print(df.dtypes)

1order_id  ...                                              items
20         1  ...  [{'sku': 'A-1', 'qty': 2, 'price': 9.99}, {'sk...
31         2  ...           [{'sku': 'C-3', 'qty': 5, 'price': 4.0}]
42         3  ...  [{'sku': 'A-1', 'qty': 1, 'price': 9.99}, {'sk...
53         4  ...                                                 []
6
7[4 rows x 3 columns]
8order_id     int64
9customer    object
10items       object
11dtype: object

df is lazy and ClickHouse-backed: it records what you asked for and only computes when you materialize a result by printing, taking len(), or calling .to_pandas(). The customer and items columns have dtype object, which means nested dicts and lists: the structure you expect from JSON.

To see what is inside those columns, call .to_pandas() and inspect a row:

1pdf0 = df.to_pandas()
2print("customer[0]:", pdf0["customer"].iloc[0])
3print("items[0]:   ", pdf0["items"].iloc[0])

1customer[0]: {'id': 101, 'country': 'GB', 'tier': 'gold'}
2items[0]:    [{'sku': 'A-1', 'qty': 2, 'price': 9.99}, {'sku': 'B-2', 'qty': 1, 'price': 19.5}]

customer is a Python dict, and items is a Python list of dicts. That is the shape you work with.

Flatten the array column with `.explode()`

Call .explode("items") on the DataStore to unroll the array column. Each element becomes its own row, and the other columns (order_id, customer) repeat:

1flat = df.explode("items")
2print(flat.to_pandas()[["order_id", "items"]].to_string())

1order_id                                    items
20         1  {'sku': 'A-1', 'qty': 2, 'price': 9.99}
30         1  {'sku': 'B-2', 'qty': 1, 'price': 19.5}
41         2   {'sku': 'C-3', 'qty': 5, 'price': 4.0}
52         3  {'sku': 'A-1', 'qty': 1, 'price': 9.99}
62         3   {'sku': 'D-4', 'qty': 3, 'price': 2.5}
72         3  {'sku': 'E-5', 'qty': 1, 'price': 99.0}
83         4                                      NaN

Order 1 had two items: it becomes two rows. Order 3 had three items: it becomes three rows. Order 4 had an empty array: it becomes one row with NaN for items, following standard pandas .explode() behavior.

Extract nested fields and compute a derived column

After calling .to_pandas() on the exploded result, each items cell is a Python dict and each customer cell is a Python dict. Extract the fields you need with .apply():

1flat_pdf = flat.to_pandas().dropna(subset=["items"])
2
3flat_pdf["country"]    = flat_pdf["customer"].apply(lambda c: c["country"])
4flat_pdf["tier"]       = flat_pdf["customer"].apply(lambda c: c["tier"])
5flat_pdf["sku"]        = flat_pdf["items"].apply(lambda i: i["sku"])
6flat_pdf["qty"]        = flat_pdf["items"].apply(lambda i: i["qty"])
7flat_pdf["price"]      = flat_pdf["items"].apply(lambda i: i["price"])
8flat_pdf["line_total"] = (flat_pdf["qty"] * flat_pdf["price"]).round(2)
9
10result = flat_pdf[["order_id", "country", "tier", "sku", "qty", "line_total"]]
11print(result.sort_values(["order_id", "sku"]).to_string(index=False))

1order_id country   tier sku  qty  line_total
2        1      GB   gold A-1    2       19.98
3        1      GB   gold B-2    1       19.50
4        2      AU silver C-3    5       20.00
5        3      GB   gold A-1    1        9.99
6        3      GB   gold D-4    3        7.50
7        3      GB   gold E-5    1       99.00

That is the flattened table. dropna(subset=["items"]) removes the NaN row for the order with an empty items array. To keep those rows, skip the dropna and guard the lambda with if isinstance(i, dict) else None.

Aggregate the flattened rows

With the flat DataFrame in hand, standard pandas aggregation works as usual:

1revenue = flat_pdf.groupby("country")["line_total"].sum().round(2).sort_values(ascending=False)
2print(revenue)

1country
2GB    155.97
3AU     20.00
4Name: line_total, dtype: float64

What this replaces: pandas json_normalize

The pandas equivalent is json_normalize with a record_path for the array and a meta list for the parent fields you want to carry down:

1import json
2import pandas as pd
3
4with open("orders.json") as f:
5    records = json.load(f)
6
7flat_pdf = pd.json_normalize(
8    records,
9    record_path="items",
10    meta=["order_id", ["customer", "country"], ["customer", "tier"]],
11)
12flat_pdf["line_total"] = (flat_pdf["qty"] * flat_pdf["price"]).round(2)
13flat_pdf = flat_pdf.rename(columns={"customer.country": "country", "customer.tier": "tier"})
14print(flat_pdf[["order_id", "country", "tier", "sku", "qty", "line_total"]]
15      .sort_values(["order_id", "sku"]).to_string(index=False))

Same result. The work is in the setup: you name the record path, hand-write each parent field as a path list, then rename the dotted column names that json_normalize produces. The chDB approach states the flatten as .explode("items") plus dict-key access, which maps to the pandas operations you already know.

Hand off to real pandas when you need it

When a library needs an actual pandas.DataFrame, call .to_pandas() to materialize one:

1df = pd.read_json("orders.json")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy read and explode on the chDB object, where it is fast, then .to_pandas() the result and continue with ordinary pandas for the field-extraction step.

Is it faster than pandas?

On a 163 MB JSON file (800,000 orders, each with a nested customer object and an array of line items), running the full flatten-and-aggregate pipeline (chDB read_json + explode + apply field extraction + groupby vs pandas json_normalize + groupby), best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14:

1import pandas as pd (json_normalize):      7.980s
2import chdb.datastore as pd (explode):     2.150s
3speedup:                                   3.7x

About 3.7x here, from the same code with one import changed. Absolute times shift with cache state and concurrent load; the back-to-back ratio is the robust claim. The gap comes from the read and explode step: chDB parses the JSON file and unrolls the array in compiled ClickHouse code, then .to_pandas() hands a smaller result to Python for the dict field extraction. The pandas path parses the whole file in Python first, which dominates the total time.

On a small file (a few hundred records) plain pandas is likely as fast or faster, because chDB pays a fixed per-query startup cost. Reach for chDB when the file is large enough that parsing is the bottleneck.

Works in Jupyter

Printing a chDB object renders it as a table in a notebook. After .to_pandas() the result feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself

The complete, runnable example is here, with generate.sh to create the sample JSON files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-flatten-nested-json

Working with JSON in other formats? Read a JSON file in Python and read an NDJSON file in Python use the same import chdb.datastore as pd pattern. New to the line-delimited format? See what is NDJSON. If you keep re-reading the same nested JSON, convert it to Parquet once and every later read gets faster.

The same code scales from a file on your laptop to a server to ClickHouse Cloud with no rewrite.

The shape we're flattening

Read a nested JSON file into a DataFrame

Flatten the array column with `.explode()`

Extract nested fields and compute a derived column

Aggregate the flattened rows

What this replaces: pandas json_normalize

Hand off to real pandas when you need it

Is it faster than pandas?

Works in Jupyter

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

How to flatten nested JSON in Python

The shape we're flattening #

Read a nested JSON file into a DataFrame #

Flatten the array column with .explode() #

Extract nested fields and compute a derived column #

Aggregate the flattened rows #

What this replaces: pandas json_normalize #

Hand off to real pandas when you need it #

Is it faster than pandas? #

Works in Jupyter #

Run it yourself #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

The shape we're flattening

Read a nested JSON file into a DataFrame

Flatten the array column with `.explode()`

Extract nested fields and compute a derived column

Aggregate the flattened rows

What this replaces: pandas json_normalize

Hand off to real pandas when you need it

Is it faster than pandas?

Works in Jupyter

Run it yourself