To read an NDJSON file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:
1import chdb.datastore as pd
2df = pd.read_json("data/events.ndjson", lines=True)df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.
1pip install chdbPrefer the command line? See how to query an NDJSON file to do this with clickhouse local.
Read an NDJSON file into a DataFrame #
pd.read_json with lines=True reads one JSON object per line and infers each column's type:
1import chdb.datastore as pd
2
3df = pd.read_json("data/events.ndjson", lines=True)
4print(df)
5print(df.dtypes)1 event_id country event_type revenue user tags
20 1 GB purchase 49.99 [101, gold] [web, promo]
31 2 AU view 0.00 [102, silver] [app]
42 3 GB purchase 19.50 [103, gold] [web]
53 4 IN purchase 5.00 [104, bronze] [app, promo, ref]
64 5 AU purchase 99.00 [105, gold] []
7event_id int64
8country str
9event_type str
10revenue float64
11user object
12tags object
13dtype: objectLook at the dtypes. chDB inferred them from the data: event_id is an integer and revenue is a float. You did not declare a single type. Nested columns (user, tags) come back as object — Python lists or dicts, exactly as they appeared in the file.
df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.
Filter and aggregate the way you already do #
The pandas you write does not change. Filter with a boolean mask, group, and aggregate:
1import chdb.datastore as pd
2
3df = pd.read_json("data/events.ndjson", lines=True)
4purchases = df[df["event_type"] == "purchase"].groupby("country")["revenue"].sum().sort_values(ascending=False)
5print(purchases)1country
2AU 99.00
3GB 69.49
4IN 5.00
5Name: revenue, dtype: float64Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first.
Access nested fields #
NDJSON rows often carry nested objects and arrays. chDB reads them into object columns. To extract a sub-field, materialize the column with .to_pandas() and use .apply():
1df = pd.read_json("data/events.ndjson", lines=True)
2user_tiers = df[["event_id", "revenue"]].copy().to_pandas()
3user_tiers["tier"] = df["user"].to_pandas().apply(lambda u: u["tier"] if isinstance(u, dict) else u[1])
4gold = user_tiers[user_tiers["tier"] == "gold"]
5print(gold)1 event_id revenue tier
20 1 49.99 gold
32 3 19.50 gold
44 5 99.00 goldThe pattern is: do the filtering and aggregation on the chDB object (where it is fast), then .to_pandas() the relevant column when you need to reach into nested data.
Hand off to real pandas when you need it #
When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:
1df = pd.read_json("data/events.ndjson", lines=True)
2pdf = df.to_pandas() # a real pandas.DataFrame, in memoryThe usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.
Is it faster than pandas.read_json? #
On a file big enough that scanning is the actual cost, yes. NDJSON is a text format, and chDB parses and aggregates in compiled, multi-threaded ClickHouse code rather than the Python layer. The filter and aggregation above on a 2M-row NDJSON file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:
1import pandas as pd: 0.174s
2import chdb.datastore as pd: 0.031s
3speedup: 5.5xAbout 5.5x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load.
The small case can cut the other way: on a tiny NDJSON file (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.
Works in Jupyter #
Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.
Run it yourself #
The complete, runnable example is here, with generate.sh to create the sample NDJSON files plus run.ipynb and a run.py mirror containing the exact code above:
github.com/ClickHouse/examples/tree/main/local-analytics/chdb-ndjson
Working with other formats? Read a JSONL file in Python and read a JSON file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. For deeply nested documents, see flatten nested JSON in Python.
The same code scales from an NDJSON file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.