How to query a REST API in Python

Al Brown
Last updated: Jun 15, 2026

To query a REST API in Python, use chDB to read the JSON response directly into a DataFrame:

1from chdb.datastore import DataStore
2
3df = DataStore.from_url("https://api.example.com/orders", format="JSONEachRow")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Prefer the command line? See how to query a REST API with SQL to do this with clickhouse local.

Read a JSON API response into a DataFrame #

The companion example serves sample data over Python's built-in http.server. That stand-in for a real API shows the exact call you would make against any live JSON endpoint.

1from chdb.datastore import DataStore
2
3df = DataStore.from_url("http://127.0.0.1:8731/orders.json", format="JSONEachRow")
4print(df)
5print(df.dtypes)
1   id country   status  amount          labels
20   0      GB     open   10.99    [emea, gold]
31   1      AU   closed   12.36  [apac, silver]
42   2      IN  pending   13.73  [amer, bronze]
53   3      US     open   15.10    [emea, gold]
64   4      DE   closed   16.47  [apac, silver]
7id           int64
8country        str
9status         str
10amount     float64
11labels      object
12dtype: object

chDB infers the schema from the response: id comes back as an integer, amount as a float. The nested labels field comes back as an object column (a numpy array per row when you materialize). You declared nothing.

One thing is different from a regular DataFrame: df is not the response loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large responses.

Filter and aggregate the way you already do #

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1from chdb.datastore import DataStore
2
3df = DataStore.from_url("http://127.0.0.1:8731/orders.json", format="JSONEachRow")
4revenue = (
5    df[df["status"] != "closed"]
6    .groupby("country")["amount"]
7    .sum()
8    .sort_values(ascending=False)
9)
10print(revenue.to_pandas())
1country
2US    15.10
3IN    13.73
4GB    10.99
5Name: amount, dtype: float64

Same syntax as pandas, same result. chDB compiles the whole chain into one optimized query and only materializes what you asked for, rather than pulling the full response into memory first. Joins with merge, computed columns with assign, and the .str and .dt accessors all work the same way.

Hand off to real pandas when you need it #

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = DataStore.from_url("http://127.0.0.1:8731/orders.json", format="JSONEachRow")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object first, then .to_pandas() the small result and continue in ordinary pandas.

Multiple pages: read each URL, then concat #

When an API paginates across predictable URLs, read each page into a DataFrame and concat them:

1import pandas as pd
2from chdb.datastore import DataStore
3
4page1 = DataStore.from_url("http://127.0.0.1:8731/orders_page1.json", format="JSONEachRow").to_pandas()
5page2 = DataStore.from_url("http://127.0.0.1:8731/orders_page2.json", format="JSONEachRow").to_pandas()
6pages = pd.concat([page1, page2], ignore_index=True)
7print(pages.groupby("country")["amount"].sum().sort_values(ascending=False))
1country
2IN    1.82
3US    1.06
4AU    0.91
5DE    0.53
6GB    0.00
7Name: amount, dtype: float64

Is it faster than requests + json? #

On a large response, yes. The same filter-and-aggregate over a 2M-row (~120 MB) JSON endpoint, served on localhost so network latency is out of the picture, best-of-3 with a warm cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS):

1requests + json + manual agg:   2.067s
2DataStore.from_url (chDB):      0.162s
3speedup:                        12.8x

About 12-13x faster here. The gap is fetch-plus-parse-plus-aggregate cost: chDB streams the response and aggregates as it goes, while the Python loop decodes every line into objects before it can do anything with them. Against a real API the network round trip adds equally to both sides, so it does not change the ratio; it just raises the floor.

The back-to-back ratio is the robust claim; absolute times shift with concurrent load. On a tiny response the comparison can flip, because chDB pays a small fixed per-query cost that dominates when there is barely any data. Reach for DataStore.from_url when the payload is large enough that parsing and aggregating it is the real work.

Hardware: Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm.

Works in Jupyter #

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself #

The complete, runnable example is here, with generate.sh to create the sample JSON files, plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-rest-api

Working with JSON from other sources? Read a JSON file in Python and read a JSON Lines file in Python use the same pandas drop-in pattern; only the reader changes. The same code scales from a local script to ClickHouse Cloud with no rewrite.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...