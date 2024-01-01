Execution Engine Configuration
DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.
Available Engines
|Engine
|Description
|Best For
auto
|Automatically selects best engine per operation
|General use (default)
chdb
|Forces all operations through ClickHouse SQL
|Large datasets, aggregations
pandas
|Forces all operations through pandas
|Compatibility testing, pandas-specific features
Setting the Engine
Global Configuration
Checking Current Engine
Auto Mode
In
auto mode (default), DataStore selects the optimal engine for each operation:
Operations Executed in chDB
- SQL-compatible filtering (
filter(),
where())
- Column selection (
select())
- Sorting (
sort(),
orderby())
- Grouping and aggregation (
groupby().agg())
- Joins (
join(),
merge())
- Distinct (
distinct(),
drop_duplicates())
- Limiting (
limit(),
head(),
tail())
Operations Executed in pandas
- Custom apply functions (
apply(custom_func))
- Complex pivot tables with custom aggregations
- Operations not expressible in SQL
- When input is already a pandas DataFrame
Example
chDB Mode
Force all operations through ClickHouse SQL:
When to Use
- Processing large datasets (millions of rows)
- Heavy aggregation workloads
- When you want maximum SQL optimization
- Consistent behavior across all operations
Performance Characteristics
|Operation Type
|Performance
|GroupBy/Aggregation
|Excellent (up to 20x faster)
|Complex Filtering
|Excellent
|Sorting
|Very Good
|Simple Single Filters
|Good (slight overhead)
Limitations
- Custom Python functions may not be supported
- Some pandas-specific features require conversion
pandas Mode
Force all operations through pandas:
When to Use
- Compatibility testing with pandas
- Using pandas-specific features
- Debugging pandas-related issues
- When data is already in pandas format
Performance Characteristics
|Operation Type
|Performance
|Simple Single Operations
|Good
|Custom Functions
|Excellent
|Complex Aggregations
|Slower than chDB
|Large Datasets
|Memory intensive
Cross-DataStore Engine
Configure the engine for operations that combine columns from different DataStores:
Example
Engine Selection Logic
Auto Mode Decision Tree
Function-Level Override
Some functions can have their engine explicitly configured:
See Function Config for details.
Performance Comparison
Benchmark results on 10M rows:
|Operation
|pandas (ms)
|chdb (ms)
|Speedup
|GroupBy count
|347
|17
|19.93x
|Combined ops
|1,535
|234
|6.56x
|Complex pipeline
|2,047
|380
|5.39x
|Filter+Sort+Head
|1,537
|350
|4.40x
|GroupBy agg
|406
|141
|2.88x
|Single filter
|276
|526
|0.52x
Key insights:
- chDB excels at aggregations and complex pipelines
- pandas is slightly faster for simple single operations
- Use
automode to get the best of both