How to read a JSONL file in Python (faster than pandas)

Al Brown
Last updated: Jun 8, 2026

To read a JSONL file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_json("data/orders.jsonl", lines=True)

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Prefer the command line? See how to run SQL on a JSONL file to do this with clickhouse local.

Read a JSONL file into a DataFrame #

pd.read_json with lines=True reads json-lines files the way you expect. Point it at a path and print the result:

1import chdb.datastore as pd
2
3df = pd.read_json("data/orders.jsonl", lines=True)
4print(df)
5print(df.dtypes)
1   order_id country status  amount       customer                   skus
20         1      GB   paid   49.99    [101, gold]         [sku-a, sku-b]
31         2      AU   cart    0.00  [102, silver]                [sku-c]
42         3      GB   paid   19.50    [103, gold]                [sku-a]
53         4      IN   paid    5.00  [104, bronze]  [sku-d, sku-e, sku-f]
64         5      AU   paid   99.00    [105, gold]                     []
7order_id      int64
8country         str
9status          str
10amount      float64
11customer     object
12skus         object
13dtype: object

Look at the dtypes. chDB inferred them from the data: order_id is an integer, amount is a float. Nested fields like customer and array fields like skus come through as object columns, ready to work with in pandas.

One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1import chdb.datastore as pd
2
3df = pd.read_json("data/orders.jsonl", lines=True)
4revenue = (df[df["status"] == "paid"]
5           .groupby("country")["amount"].sum()
6           .sort_values(ascending=False))
7print(revenue)
1country
2AU    99.00
3GB    69.49
4IN     5.00
5Name: amount, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches, rather than materializing the full file first.

Access nested fields and explode arrays #

When you need to work with the nested customer struct or the skus array, materialize the small file with .to_pandas() first and then use standard pandas:

1pdf = df.to_pandas()
2
3# customer comes back as [id, tier]; index 1 is the tier
4pdf["tier"] = pdf["customer"].apply(lambda c: c[1])
5print(pdf[["order_id", "country", "status", "tier"]].to_string(index=False))
1 order_id country status   tier
2        1      GB   paid   gold
3        2      AU   cart silver
4        3      GB   paid   gold
5        4      IN   paid bronze
6        5      AU   paid   gold

The skus column is an array per row. Use .explode() to flatten it into one row per SKU:

1paid = pdf[pdf["status"] == "paid"][["country", "skus"]].explode("skus")
2print(paid.dropna(subset=["skus"]).reset_index(drop=True))
1  country   skus
20      GB  sku-a
31      GB  sku-b
42      GB  sku-a
53      IN  sku-d
64      IN  sku-e
75      IN  sku-f

The pattern is: do the heavy filter and aggregation on the chDB object (fast, ClickHouse-backed), then .to_pandas() the result and continue in ordinary pandas for struct unpacking or explode operations.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = pd.read_json("data/orders.jsonl", lines=True)
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than pandas.read_json? #

On a file big enough that scanning is the actual cost, yes. pandas.read_json(lines=True) decodes every line into Python objects and builds a DataFrame before it can group anything, while chDB streams the parse and aggregation in compiled, multi-threaded code and only reads what you ask for. The aggregation above on a 2M-row JSONL file, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1import pandas as pd:              1.555s
2import chdb.datastore as pd:      0.123s
3speedup:                          12.6x

About 12x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because plain pandas decodes and holds the full file in memory before it can group anything, while chDB streams and aggregates in one pass.

The small case can cut the other way: on a tiny JSONL (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample JSONL files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-jsonl

Working with other shapes of JSON? See read a JSON file in Python and read an NDJSON file in Python. They use the same import chdb.datastore as pd pattern. If you want to avoid re-parsing JSONL on every read, convert it to Parquet once and every later read gets faster.

The same code scales from a JSONL file on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...