How to read a JSON file in Python (faster than pandas)

Al Brown
Last updated: Jun 8, 2026

To read a JSON file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_json("data/events.ndjson", lines=True)   # NDJSON: one object per line

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Prefer the command line? See how to run SQL on a JSON file to do this with clickhouse local.

Read an NDJSON file into a DataFrame #

Most "JSON" data you query in bulk is NDJSON (newline-delimited JSON, also called json-lines): one JSON object per line. Pass lines=True and chDB reads it directly, inferring column types from the data:

1import chdb.datastore as pd
2
3df = pd.read_json("data/events.ndjson", lines=True)
4print(df)
5print(df.dtypes)
1   event_id country event_type  revenue           user                  items
20         1      GB   purchase    49.99    [101, gold]         [sku-a, sku-b]
31         2      AU       view     0.00  [102, silver]                [sku-c]
42         3      GB   purchase    19.50    [103, gold]                [sku-a]
53         4      IN   purchase     5.00  [104, bronze]  [sku-d, sku-e, sku-f]
64         5      AU   purchase    99.00    [105, gold]                     []
7event_id        int64
8country           str
9event_type        str
10revenue       float64
11user           object
12items          object
13dtype: object

chDB inferred the types automatically: event_id is an integer, revenue is a float. Nested objects (user) and arrays (items) survive as object columns. You did not declare a single dtype.

One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result: print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

A top-level JSON array (a .json file whose outer structure is [{...}, {...}]) reads without lines=True:

1arr_df = pd.read_json("data/events_array.json")

Filter and aggregate the way you already do #

The pandas you write does not change. Use a boolean mask to filter, then groupby and aggregate:

1import chdb.datastore as pd
2
3df = pd.read_json("data/events.ndjson", lines=True)
4purchases = (df[df["event_type"] == "purchase"]
5             .groupby("country")["revenue"].sum()
6             .sort_values(ascending=False))
7print(purchases)
1country
2AU    99.00
3GB    69.49
4IN     5.00
5Name: revenue, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first.

Nested objects: extract fields with .apply #

When chDB reads an NDJSON file with nested objects, those objects come back as struct columns (arrays of values in field order). After calling .to_pandas(), use .apply to pull out the fields you need:

1import pandas as real_pd
2
3pdf = df.to_pandas()
4pdf["user_id"] = pdf["user"].apply(lambda x: x[0])
5pdf["user_tier"] = pdf["user"].apply(lambda x: x[1])
6print(pdf[["event_id", "user_id", "user_tier", "revenue"]])
1   event_id  user_id user_tier  revenue
20         1      101      gold    49.99
31         2      102    silver     0.00
42         3      103      gold    19.50
53         4      104    bronze     5.00
64         5      105      gold    99.00

Nested arrays: flatten with .explode #

Arrays come back as list columns. Call .explode() on a real pandas DataFrame to turn each element into its own row:

1pdf = df.to_pandas()
2exploded = (pdf[["event_id", "items"]]
3            .explode("items")
4            .dropna(subset=["items"])
5            .reset_index(drop=True))
6print(exploded)
1   event_id  items
20         1  sku-a
31         1  sku-b
42         2  sku-c
53         3  sku-a
64         4  sku-d
75         4  sku-e
86         4  sku-f

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas object (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1result = df[df["event_type"] == "purchase"].groupby("country")["revenue"].sum()
2pdf_result = result.to_pandas()   # a real pandas.Series, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than pandas.read_json? #

On a file big enough that scanning is the actual cost, yes. JSON is a text format, so most of the work is parsing bytes into typed values, and chDB does that in compiled, multi-threaded ClickHouse code rather than the Python layer. The same filter-and-aggregate above on a 2M-row (~146 MB) NDJSON file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1import pandas as pd:            1.714s
2import chdb.datastore as pd:    0.134s
3speedup:                        12.8x

About 13x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because plain pandas parses and holds the whole file in memory before it can group anything, while chDB streams and aggregates in one pass.

The small case can cut the other way: on a tiny JSON file (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample JSON files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-json

Working with other formats? Read a CSV file in Python and read a Parquet file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. If you keep re-reading the same JSON, convert it to Parquet once and every later read gets faster.

The same code scales from a JSON file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...