The Journey to Zero-Copy: How chDB Became the Fastest SQL Engine on Pandas DataFrame

Xiaozhe Yu Auxten Wang
Jan 9, 2026 - 11 minutes read

The Problem: A Great Engine Trapped Behind Serialization #

If you've worked with data in Python, you know the drill: Pandas is everywhere. It's the lingua franca of data science, the format that bridges your data loading, cleaning, analysis, and visualization. But when datasets grow beyond a few million rows, Pandas starts to struggle. Single-threaded execution, memory-hungry operations, and clunky syntax for complex aggregations become real pain points.

Meanwhile, ClickHouse sits there—the fastest open-source OLAP engine on the planet—waiting to help. But traditionally, using ClickHouse meant setting up servers, loading data into tables, and managing a whole database infrastructure. Not exactly what you want when you just need to run a quick aggregation on a DataFrame.

That's why we built chDB: ClickHouse packaged as a simple Python library. pip install chdb and you're done.

But here's the thing—our first version had a dirty secret.

The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240.jpg

Every query went through four serialization/deserialization steps. The result? Almost every query took over 100ms, regardless of complexity. We had the world's fastest engine, but it was choking on data conversion overhead.

The Vision: What Users Actually Needed #

From the user's perspective, what they wanted was simple:

"I have a DataFrame. I want to run SQL on it. Give me back a DataFrame. Fast."

No temp files. No format conversions. No memory explosions. Just seamless, native integration.

More specifically:

  • DataFrame In, DataFrame Out — Query any Pandas DataFrame with SQL, get results as a DataFrame
  • Zero Configuration — No registration, no schema definitions, just reference your variable name in SQL
  • ClickHouse Performance — Multi-threaded execution, vectorized processing, all the good stuff
  • Memory Efficient — Handle datasets larger than RAM through streaming

This is what we set out to build with chDB v2.

What We Built: True Native Integration #

After months of work, here's what using chDB looks like today:

1import pandas as pd
2import chdb
3
4# Your data, wherever it comes from
5the_df = pd.DataFrame({
6    'user_id': range(1_000_000),
7    'category': ['A', 'B', 'C'] * 333334,
8    'value': [i * 0.1 for i in range(1_000_000)]
9})
10
11# SQL on DataFrame — just reference the variable name!
12result = chdb.query("""
13    SELECT category,
14           COUNT(*) as cnt,
15           AVG(value) as avg_val
16    FROM Python(the_df)
17    GROUP BY category
18    ORDER BY cnt DESC
19""", "DataFrame")
20
21print(result)

That's it. No registration. No temp files. The DataFrame variable df is automatically discovered and made available as a table. The result comes back as a DataFrame, ready for your next Pandas operation or visualization.

The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240-1.jpg

This was a huge leap—87x faster than v1.0! But we weren't done yet.

Under the Hood: The Technical Journey #

Challenge 1: Automatic DataFrame Discovery #

When you write Python(df) in your SQL, chDB needs to find that variable in your Python environment. We built an automatic discovery mechanism that:

  1. Searches for the variable name in local and global scope
  2. Validates it's actually a Pandas DataFrame
  3. Parses column names and data types
  4. Wraps it as a ClickHouse table function

No explicit registration required—just name your DataFrame and use it.

The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240-2.jpg

Challenge 2: The GIL Problem #

Here's where things got interesting. ClickHouse is a massively parallel engine—it wants to use all your CPU cores. Python's Global Interpreter Lock (GIL) says "one thread at a time, please."

If we called Python's C API for every data access, our multi-threaded engine would degrade to single-threaded execution. The solution? Minimize CPython API calls and batch all Python interactions before the parallel pipeline starts.

The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240-3.jpg

Challenge 3: Python String Encoding Nightmare #

Python's str type is... complicated. It could be UTF-8, UTF-16, UTF-32, or something else entirely. Converting to UTF-8 (what ClickHouse expects) via Python's API? That means acquiring the GIL for every string.

So we did something a bit crazy: we rewrote Python's string encoding logic in C++. This let us handle string conversion in parallel, without touching the GIL.

The result? Our test query Q23 (SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10) went from 8.6 seconds to 0.56 seconds—a 15x improvement from this single optimization.

┌─────────────────────────────────────────────────────────────────┐
│              String Encoding Performance Impact                 │
│                                                                 │
│   Q23 Query Time (seconds)                                      │
│                                                                 │
│   Python API Encoding:  ████████████████████████████████ 8.6s   │
│   C++ Native Encoding:  ██ 0.56s                                │
│                                                                 │
│   15x faster! 🚀                                                │
└─────────────────────────────────────────────────────────────────┘

Semi-Structured Data: Beyond Simple Columns #

Real-world DataFrames often contain messy object columns with nested dictionaries. chDB handles these automatically:

1import pandas as pd
2import chdb
3
4# DataFrame with nested JSON-like data
5data = pd.DataFrame({
6    'event': [
7        {'type': 'click', 'metadata': {'x': 100, 'y': 200}},
8        {'type': 'scroll', 'metadata': {'x': 150, 'y': 300}},
9        {'type': 'click', 'metadata': {'x': 200, 'y': 400}}
10    ]
11})
12
13# Query nested fields directly with ClickHouse JSON syntax
14result = chdb.query("""
15    SELECT
16        event.type,
17        event.metadata.x as x_coord,
18        event.metadata.y as y_coord
19    FROM Python(data)
20    WHERE event.type = 'click'
21""", "DataFrame")
The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240-4.jpg

chDB samples object columns, detects dictionary structures, and automatically maps them to ClickHouse's native JSON type—giving you full access to ClickHouse's powerful JSON functions.

Performance: The Numbers #

We benchmarked chDB against native Pandas operations using the in-mem DataFrame ClickBench dataset (1M rows, ~117MB in Parquet).

Simple Aggregation: COUNT(*) #

chDB SQL Statement:

1SELECT COUNT(*) FROM Python(df);

Corresponding Pandas Operation:

1df.count()

By fully utilizing multi-core CPUs through ClickHouse's multi-threaded execution engine, chDB's speed in executing COUNT(*) aggregation is far beyond Pandas, with performance improved by nearly 247x.

EngineTimeSpeedup
Pandas0.8643s1x
chDB0.0035s247x

Complex Query: GROUP BY + Multiple Aggregations #

chDB SQL Statement:

1SELECT
2    RegionID,
3    SUM(AdvEngineID),
4    COUNT(*) AS c,
5    AVG(ResolutionWidth),
6    COUNT(DISTINCT UserID)
7FROM Python(df)
8GROUP BY RegionID
9ORDER BY c DESC
10LIMIT 10

Corresponding Pandas Operation:

1df.groupby("RegionID")
2  .agg(
3      AdvEngineID=("AdvEngineID", "sum"),
4      ResolutionWidth=("ResolutionWidth", "mean"),
5      UserID=("UserID", "nunique"),
6      c=("RegionID", "size")
7  )
8  .sort_values("c", ascending=False)
9  .head(10)

Thanks to the refined and optimized Group By operations by ClickHouse's query execution engine, chDB still outperforms Pandas native operations in complex scenarios involving "grouping + sorting + multi-aggregation".

EngineTimeSpeedup
Pandas0.0623s1x
chDB0.0219s2.8x

Streaming: Breaking Memory Limits #

What if your query result is too large to fit in memory? chDB supports streaming results:

1import chdb
2
3# Initialize chDB connection
4conn = chdb.connect()
5
6# Construct query (generate 500,000 rows of data)
7query = "SELECT number FROM numbers(500000)"
8
9# Stream query: retrieve results by block and process
10with conn.send_query(query, "DataFrame") as stream_result:
11    for chunk_df in stream_result:
12        # Custom business logic
13        print(chunk_df)
14
15# Close connection
16conn.close()
The Journey to Zero-Copy_ How chDB Became the Fastest SQL Engine on Pandas DataFrame #1240-5.jpg

No more OOM errors. Process terabytes of data on your laptop.

v4.0: Closing the Zero-Copy Loop #

With v2.0, we achieved zero-copy on the input side—reading DataFrames directly without serialization. But the output path still went through Parquet serialization. In chDB v4.0, we completed the circle.

Journey to Zero-Copy Blog Image.jpg

How Output Zero-Copy Works #

When ClickHouse produces query results, instead of serializing to Parquet and deserializing back, we now:

  1. Direct Type Mapping: Map ClickHouse column types directly to NumPy dtypes
  2. Memory Sharing: Share underlying memory buffers where possible
  3. Batch Conversion: Convert result chunks directly to NumPy arrays using SIMD-optimized routines

Here's the complete type mapping we implemented:

ClickHouse TypePython/NumPy Type
Int8/16/32/64numpy.int8/16/32/64
UInt8/16/32/64numpy.uint8/16/32/64
Float32/64numpy.float32/64
StringPython str
Date/DateTimenumpy.datetime64
Arraynumpy.ndarray
MapPython dict
JSONPython dict (recursive)
Nullable(T)numpy.ma.MaskedArray

Benchmark: v4.0 Output Performance #

We benchmarked chDB’s performance in exporting query results to Pandas DataFrames against DuckDB, a similar embedded analytics engine, using the ClickBench hits dataset.

Test Environment

  • Dataset: ClickBench hits dataset (1 million rows, Parquet format, file size ~117MB)
  • Hardware Environment: AWS EC2 c6a.4xlarge instance
  • Test Method: Execute a single query 3 times and take the best result
  • Comparison Scenario: chDB exporting to DataFrame vs. DuckDB exporting to Pandas DataFrame

Test Code:

1# chDB: Query Parquet file and export to Pandas DataFrame
2import chdb
3chdb.query("SELECT * FROM file('hits_0.parquet')", "DataFrame")
4
5# DuckDB: Query Parquet file and export to Pandas DataFrame
6import duckdb
7duckdb.query("SELECT * FROM read_parquet('hits_0.parquet')").df()

Export Time

  • chDB: 2.6418 seconds
  • DuckDB: 3.4744 seconds

Test results show that in the scenario of exporting 1 million rows of data to Pandas DataFrame, chDB's time consumption is reduced by about 24% compared to DuckDB, demonstrating superior data conversion performance.

Benchmark for daily Pandas usage #

To provide everyone with a relatively complete performance comparison of libraries that can directly read and write Pandas DataFrames, we selected 14 common operations and conducted benchmarks on chDB, Pandas, and DuckDB across three different data sizes. The testing hardware was MacBook M4 Pro + 48G memory. We can clearly see several facts:

  1. For slicing operations like Head/Limit, Pandas itself has always been the fastest
  2. As data volume increases, the advantages of chDB and DuckDB begin to emerge, with chDB maintaining a lead in most 1M and 10M row challenges
  3. In most cases, chDB and DuckDB performance is very close, and thanks to v4.0.0 improvements, chDB maintains a certain advantage, achieving 7:3 comparing with Pandas and DuckDB with 10M data volume
Zero-Copy Benchmark.webp

Benchmark and the chart code in github.com/auxten/chdb-ds.

Where We Started, Where We Are #

Journey to Zero-Copy Blog Post (1).jpg

What You Get as a User #

Let's summarize what chDB + Pandas integration means for you:

CapabilityWhat It Means
DataFrame InQuery any Pandas DataFrame with SQL, no registration needed
DataFrame OutResults come back as DataFrames, ready for visualization
ClickHouse SpeedMulti-threaded, vectorized execution on your data
Zero ConfigJust pip install chdb and you're ready
JSON SupportQuery nested dictionaries naturally
StreamingProcess datasets larger than your RAM
80+ FormatsAlso query CSV, Parquet, S3, MySQL, PostgreSQL...

Getting Started #

chDB v4 is now in beta!

We welcome everyone to test chDB v4 and share feedback. pip install "chdb>=4.0.0b2"

1#pip install chdb
2
3import pandas as pd
4import chdb
5
6# Load your data however you want
7df = pd.read_csv("your_data.csv")
8
9# Query with SQL
10result = chdb.query("""
11    SELECT column_a, SUM(column_b)
12    FROM Python(df)
13    GROUP BY column_a
14""", "DataFrame")
15
16# Use the result with any Pandas-compatible tool
17result.plot(kind='bar')

Join the Community #

We're always looking to make chDB better. Here's how to connect:

The journey from 30 seconds to 0.5 seconds was long, but we're just getting started. What will you build with chDB

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...