What is chDB? In-process ClickHouse for Python

Al Brown
Last updated: Jun 8, 2026

chDB is an in-process SQL OLAP engine for Python: an embedded build of ClickHouse you pip install and call directly, running full ClickHouse SQL over local files, pandas DataFrames, and remote sources with no server to start and no network hop. You query data where it lives and get a pandas DataFrame back:

1import chdb
2df = chdb.query("SELECT event, count() FROM file('events.parquet') GROUP BY event", "DataFrame")

There's no server to run, no connection to open, and no import step. The query runs inside your Python process and returns results in the format you ask for.

Install and run #

chDB is a single pip package with no server component:

1pip install chdb

The whole API surface starts with one function, chdb.query(sql, output_format). Point it at a file and run ClickHouse SQL over it:

1import chdb
2
3# Read a Parquet file and return a pandas DataFrame
4df = chdb.query("SELECT * FROM file('events.parquet') LIMIT 5", "DataFrame")
5print(df)
1   id     event         day  amount
20   1     click  2026-06-01     142
31   2     login  2026-06-01      88
42   3  purchase  2026-06-01     219
53   4    logout  2026-06-01      57
64   5     click  2026-06-02     134

The second argument is the output format. "DataFrame" gives you pandas; other common choices are "ArrowTable" (a pyarrow.Table, zero-copy), "Pretty" for terminal display, and "JSONEachRow" for NDJSON. The schema is inferred from the file, so there's no CREATE TABLE and no type declarations.

How it works #

chDB is the same engine as ClickHouse server and ClickHouse Cloud, compiled to run inside your Python process rather than as a separate service. Three things follow from that.

It runs in-process, with no server. There is no daemon to start, no port to bind, and no network round-trip. import chdb loads the engine into your interpreter, and chdb.query() runs against it directly. That makes it a natural fit for notebooks, scripts, and data pipelines where standing up a database would be overkill.

It speaks the full ClickHouse SQL dialect. chDB uses the same SQL as the server: the same functions, aggregate combinators, window functions, and table functions. A query you write in a notebook runs unchanged against a ClickHouse server or ClickHouse Cloud later.

Data stays in place. You don't load data into chDB; you reference it with a table function (file() for a local file, s3() for object storage, postgresql() for a database), and the engine reads only what the query needs.

What it can query #

chDB reads local files, in-memory DataFrames, and remote sources, and you can JOIN across them in a single query. Each row below links to its step-by-step Python how-to.

SourceHow to query itHow-to
JSON / NDJSON filefile('data.json', JSONEachRow)Read a JSON file in Python
Parquet filefile('data.parquet')Read a Parquet file in Python
CSV filefile('data.csv')Read a CSV file in Python
Nested / array JSONarrayJoin(...), dot pathsFlatten nested JSON in Python
pandas DataFramePython(df)(below)
Amazon S3s3('https://...')
HTTP / remote URLurl('https://...')
PostgreSQLpostgresql(...)
MySQLmysql(...)
MongoDBmongodb(...)
SQLitesqlite(...)

chDB can also query an existing pandas DataFrame in place, so you can mix SQL and pandas in the same workflow without copying data out:

1import chdb
2import pandas as pd
3
4orders = pd.read_parquet("orders.parquet")
5
6chdb.query("""
7    SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue
8    FROM Python(orders)
9    GROUP BY country
10    ORDER BY revenue DESC
11""", "DataFrame")

Because these are all table functions, a single query can join a local CSV to a Postgres table to a Parquet file on S3 — no staging tables, no manual downloads.

chDB vs pandas and pyarrow: when and why #

For loading and lightly transforming a small file, pandas.read_* is fine. chDB earns its place when the work is analytical (filtering, grouping, joining, aggregating), or when the file is large enough that pandas starts to strain memory.

Take a group-and-aggregate over a JSON file. In pandas you read the whole file into memory, then chain operations:

1import pandas as pd
2df = pd.read_json("events.ndjson", lines=True)
3out = (df[df.event_type == "purchase"]
4       .groupby("country")["revenue"].sum()
5       .sort_values(ascending=False))

In chDB it's one declarative query, and the parsing and aggregation run in compiled ClickHouse code rather than the Python interpreter:

1import chdb
2out = chdb.query("""
3    SELECT country, round(sum(revenue), 2) AS revenue
4    FROM file('events.ndjson', JSONEachRow)
5    WHERE event_type = 'purchase'
6    GROUP BY country
7    ORDER BY revenue DESC
8""", "DataFrame")

On a 2M-row, ~146 MB NDJSON file (Apple M4 Pro, best-of-3 warm), the chDB aggregate runs roughly 10x faster than the equivalent pure-Python loop, and the gap widens with file size. The SQL also reads only the columns the query touches, so memory stays flat where read_json would materialise the entire file.

Drop-in pandas: chDB DataStore #

The SQL approach above means rewriting pandas logic as SQL. As of chDB 4.0 (March 2026), you don't have to. The DataStore API lets you change one import and run your existing pandas code on the ClickHouse engine:

1# import pandas as pd
2import chdb.datastore as pd
3
4orders = pd.read_parquet("orders.parquet")
5out = (orders[orders.amount > 100]
6       .groupby("country")["amount"]
7       .sum()
8       .sort_values(ascending=False))
9print(out)

Everything after the import line is ordinary pandas. DataStore is lazy: rather than materialising a new DataFrame at every step the way pandas does, it records your operations and compiles the whole pipeline into ClickHouse SQL (with filter pushdown, column pruning, and limit propagation), executing only when you actually need a result — a print(), a len(), an iteration. On large data that often makes it much faster than pandas, and it reads and writes DataFrames zero-copy, sharing memory directly with NumPy for numeric and fixed-width columns instead of serialising.

The reason it works as a true drop-in is how it handles coverage. pandas has a vast surface, and not every operation has a ClickHouse equivalent. DataStore's planner splits the pipeline into segments at the boundaries where ClickHouse can't run an operation and routes each segment to the right engine, so an unsupported step falls back to pandas rather than erroring out. You get ClickHouse's speed where it applies and pandas' completeness everywhere else, from the same code. Behaviour is checked operation-by-operation against pandas; the chDB docs list the handful of edge cases (NULL/NaN handling, string sort order) where results can differ.

Beyond a single query #

chdb.query() is stateless: each call is independent and in-memory. When you need state (temporary tables, multiple statements, a small local database), use a Session:

1from chdb import session as chs
2
3sess = chs.Session()
4sess.query("CREATE TABLE t (id UInt64, country String) ENGINE = MergeTree ORDER BY id")
5sess.query("INSERT INTO t SELECT id, country FROM file('events.parquet')")
6top = sess.query("SELECT country, count() FROM t GROUP BY country", "DataFrame")
7sess.close()

chDB also supports parametrized queries ({name:Type} placeholders with a params dict, so you never string-format untrusted input into SQL), streaming results for sets too large to hold in memory, a DB-API 2.0 connection interface, and Python UDFs: the full ClickHouse SQL surface, callable from Python.

Prefer the command line? #

If you work from the shell or in CI rather than a notebook, the same engine ships as a standalone binary called clickhouse-local: the identical ClickHouse SQL, formats, and table functions, invoked as clickhouse local -q "...". See what is clickhouse-local; it's the command-line sibling of chDB.

Where to go next #

Prefer the command line? → what is clickhouse-local

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...