How to flatten nested JSON in Python

Al Brown
Last updated: Jun 8, 2026

To flatten nested JSON in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

import chdb.datastore as pd
df = pd.read_json("data/orders.json")
flat = df.explode("items")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

pip install chdb

The shape we're flattening

A common nested document has an outer object with a nested object field and an array-of-objects field:

[
  {"order_id": 1, "customer": {"id": 101, "country": "GB", "tier": "gold"},
   "items": [{"sku": "A-1", "qty": 2, "price": 9.99}, {"sku": "B-2", "qty": 1, "price": 19.50}]},
  {"order_id": 2, "customer": {"id": 102, "country": "AU", "tier": "silver"},
   "items": [{"sku": "C-3", "qty": 5, "price": 4.00}]},
  ...
]

The goal is a flat table: one row per line item, with the parent fields (order_id, customer.country, customer.tier) repeated on each row.

Read a nested JSON file into a DataFrame

pd.read_json reads the file and returns a lazy, ClickHouse-backed DataFrame:

import chdb.datastore as pd

df = pd.read_json("data/orders.json")
print(df)
print(df.dtypes)
order_id  ...                                              items
0         1  ...  [{'sku': 'A-1', 'qty': 2, 'price': 9.99}, {'sk...
1         2  ...           [{'sku': 'C-3', 'qty': 5, 'price': 4.0}]
2         3  ...  [{'sku': 'A-1', 'qty': 1, 'price': 9.99}, {'sk...
3         4  ...                                                 []

[4 rows x 3 columns]
order_id     int64
customer    object
items       object
dtype: object

df is lazy and ClickHouse-backed: it records what you asked for and only computes when you materialize a result by printing, taking len(), or calling .to_pandas(). The customer and items columns have dtype object, which means nested dicts and lists: the structure you expect from JSON.

To see what is inside those columns, call .to_pandas() and inspect a row:

pdf0 = df.to_pandas()
print("customer[0]:", pdf0["customer"].iloc[0])
print("items[0]:   ", pdf0["items"].iloc[0])
customer[0]: {'id': 101, 'country': 'GB', 'tier': 'gold'}
items[0]:    [{'sku': 'A-1', 'qty': 2, 'price': 9.99}, {'sku': 'B-2', 'qty': 1, 'price': 19.5}]

customer is a Python dict, and items is a Python list of dicts. That is the shape you work with.

Flatten the array column with .explode()

Call .explode("items") on the DataStore to unroll the array column. Each element becomes its own row, and the other columns (order_id, customer) repeat:

flat = df.explode("items")
print(flat.to_pandas()[["order_id", "items"]].to_string())
order_id                                    items
0         1  {'sku': 'A-1', 'qty': 2, 'price': 9.99}
0         1  {'sku': 'B-2', 'qty': 1, 'price': 19.5}
1         2   {'sku': 'C-3', 'qty': 5, 'price': 4.0}
2         3  {'sku': 'A-1', 'qty': 1, 'price': 9.99}
2         3   {'sku': 'D-4', 'qty': 3, 'price': 2.5}
2         3  {'sku': 'E-5', 'qty': 1, 'price': 99.0}
3         4                                      NaN

Order 1 had two items: it becomes two rows. Order 3 had three items: it becomes three rows. Order 4 had an empty array: it becomes one row with NaN for items, following standard pandas .explode() behavior.

Extract nested fields and compute a derived column

After calling .to_pandas() on the exploded result, each items cell is a Python dict and each customer cell is a Python dict. Extract the fields you need with .apply():

flat_pdf = flat.to_pandas().dropna(subset=["items"])

flat_pdf["country"]    = flat_pdf["customer"].apply(lambda c: c["country"])
flat_pdf["tier"]       = flat_pdf["customer"].apply(lambda c: c["tier"])
flat_pdf["sku"]        = flat_pdf["items"].apply(lambda i: i["sku"])
flat_pdf["qty"]        = flat_pdf["items"].apply(lambda i: i["qty"])
flat_pdf["price"]      = flat_pdf["items"].apply(lambda i: i["price"])
flat_pdf["line_total"] = (flat_pdf["qty"] * flat_pdf["price"]).round(2)

result = flat_pdf[["order_id", "country", "tier", "sku", "qty", "line_total"]]
print(result.sort_values(["order_id", "sku"]).to_string(index=False))
order_id country   tier sku  qty  line_total
        1      GB   gold A-1    2       19.98
        1      GB   gold B-2    1       19.50
        2      AU silver C-3    5       20.00
        3      GB   gold A-1    1        9.99
        3      GB   gold D-4    3        7.50
        3      GB   gold E-5    1       99.00

That is the flattened table. dropna(subset=["items"]) removes the NaN row for the order with an empty items array. To keep those rows, skip the dropna and guard the lambda with if isinstance(i, dict) else None.

Aggregate the flattened rows

With the flat DataFrame in hand, standard pandas aggregation works as usual:

revenue = flat_pdf.groupby("country")["line_total"].sum().round(2).sort_values(ascending=False)
print(revenue)
country
GB    155.97
AU     20.00
Name: line_total, dtype: float64

What this replaces: pandas json_normalize

The pandas equivalent is json_normalize with a record_path for the array and a meta list for the parent fields you want to carry down:

import json
import pandas as pd

with open("orders.json") as f:
    records = json.load(f)

flat_pdf = pd.json_normalize(
    records,
    record_path="items",
    meta=["order_id", ["customer", "country"], ["customer", "tier"]],
)
flat_pdf["line_total"] = (flat_pdf["qty"] * flat_pdf["price"]).round(2)
flat_pdf = flat_pdf.rename(columns={"customer.country": "country", "customer.tier": "tier"})
print(flat_pdf[["order_id", "country", "tier", "sku", "qty", "line_total"]]
      .sort_values(["order_id", "sku"]).to_string(index=False))

Same result. The work is in the setup: you name the record path, hand-write each parent field as a path list, then rename the dotted column names that json_normalize produces. The chDB approach states the flatten as .explode("items") plus dict-key access, which maps to the pandas operations you already know.

Hand off to real pandas when you need it

When a library needs an actual pandas.DataFrame, call .to_pandas() to materialize one:

df = pd.read_json("orders.json")
pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy read and explode on the chDB object, where it is fast, then .to_pandas() the result and continue with ordinary pandas for the field-extraction step.

Is it faster than pandas?

On a 163 MB JSON file (800,000 orders, each with a nested customer object and an array of line items), running the full flatten-and-aggregate pipeline (chDB read_json + explode + apply field extraction + groupby vs pandas json_normalize + groupby), best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14:

import pandas as pd (json_normalize):      7.980s
import chdb.datastore as pd (explode):     2.150s
speedup:                                   3.7x

About 3.7x here, from the same code with one import changed. Absolute times shift with cache state and concurrent load; the back-to-back ratio is the robust claim. The gap comes from the read and explode step: chDB parses the JSON file and unrolls the array in compiled ClickHouse code, then .to_pandas() hands a smaller result to Python for the dict field extraction. The pandas path parses the whole file in Python first, which dominates the total time.

On a small file (a few hundred records) plain pandas is likely as fast or faster, because chDB pays a fixed per-query startup cost. Reach for chDB when the file is large enough that parsing is the bottleneck.

Works in Jupyter

Printing a chDB object renders it as a table in a notebook. After .to_pandas() the result feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself

The complete, runnable example is here, with generate.sh to create the sample JSON files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-flatten-nested-json

Working with JSON in other formats? Read a JSON file in Python and read an NDJSON file in Python use the same import chdb.datastore as pd pattern. New to the line-delimited format? See what is NDJSON. If you keep re-reading the same nested JSON, convert it to Parquet once and every later read gets faster.

The same code scales from a file on your laptop to a server to ClickHouse Cloud with no rewrite.


Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->