Function-Level Configuration

DataStore allows fine-grained control over execution at the function level, including engine selection and Dtype correction.

Function Engine Configuration

Override the execution engine for specific functions.

Setting Function Engines

from chdb.datastore.config import function_config

# Force specific functions to use chdb
function_config.use_chdb('length', 'substring', 'concat')

# Force specific functions to use pandas
function_config.use_pandas('upper', 'lower', 'capitalize')

# Set default preference
function_config.prefer_chdb()    # Default to chdb
function_config.prefer_pandas()  # Default to pandas

# Reset to auto
function_config.reset()

When to Use

Force chdb for:

Functions with better ClickHouse performance
Functions that benefit from SQL optimization
Large-scale string/datetime operations

Force pandas for:

Functions with pandas-specific behavior
When exact pandas compatibility is required
Custom string operations

Example

from chdb import datastore as pd
from chdb.datastore.config import function_config

# Configure function engines
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper')

ds = pd.read_csv("data.csv")

# length() will use chdb
ds['name_len'] = ds['name'].str.len()

# substring() will use chdb  
ds['prefix'] = ds['name'].str.slice(0, 3)

# upper() will use pandas
ds['name_upper'] = ds['name'].str.upper()

Overlapping Functions

159+ functions are available in both chdb and pandas engines:

Category	Functions
String	`length`, `upper`, `lower`, `trim`, `ltrim`, `rtrim`, `concat`, `substring`, `replace`, `reverse`, `contains`, `startswith`, `endswith`
Math	`abs`, `round`, `floor`, `ceil`, `exp`, `log`, `log10`, `sqrt`, `pow`, `sin`, `cos`, `tan`
DateTime	`year`, `month`, `day`, `hour`, `minute`, `second`, `dayofweek`, `dayofyear`, `quarter`
Aggregation	`sum`, `avg`, `min`, `max`, `count`, `std`, `var`, `median`

For overlapping functions, the engine is selected based on:

Explicit function configuration (if set)
Global execution_engine setting
Auto-selection based on context

chdb-Only Functions

Some functions are only available through ClickHouse:

Category	Functions
Array	`arraySum`, `arrayAvg`, `arraySort`, `arrayDistinct`, `groupArray`, `arrayElement`
JSON	`JSONExtractString`, `JSONExtractInt`, `JSONExtractFloat`, `JSONHas`
URL	`domain`, `path`, `protocol`, `extractURLParameter`
IP	`IPv4StringToNum`, `IPv4NumToString`, `isIPv4String`
Geo	`greatCircleDistance`, `geoDistance`, `geoToH3`
Hash	`cityHash64`, `xxHash64`, `sipHash64`, `MD5`, `SHA256`
Conditional	`sumIf`, `countIf`, `avgIf`, `minIf`, `maxIf`

These functions automatically use chdb engine regardless of configuration.

pandas-Only Functions

Some functions are only available through pandas:

Category	Functions
Apply	Custom lambda functions, user-defined functions
Complex Pivot	Pivot tables with custom aggregations
Stack/Unstack	Complex reshaping operations
Interpolate	Time series interpolation methods

These functions automatically use pandas engine regardless of configuration.

Dtype Correction

Configure how DataStore corrects data types between engines.

Correction Levels

from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config

# No correction
config.set_correction_level(CorrectionLevel.NONE)

# Critical types only (NULL handling, boolean)
config.set_correction_level(CorrectionLevel.CRITICAL)

# High priority (default) - common type mismatches
config.set_correction_level(CorrectionLevel.HIGH)

# Medium - more aggressive correction
config.set_correction_level(CorrectionLevel.MEDIUM)

# All - correct all possible types
config.set_correction_level(CorrectionLevel.ALL)

Correction Level Details

Level	Description	Types Corrected
`NONE`	No automatic correction	None
`CRITICAL`	Essential corrections	NULL handling, boolean conversion
`HIGH` (default)	Common corrections	Integer/float precision, datetime, string encoding
`MEDIUM`	More corrections	Decimal precision, timezone handling
`ALL`	Maximum correction	All type differences

When Types Need Correction

Type differences can occur when:

ClickHouse → pandas: Different integer sizes (Int64 vs int64)
pandas → ClickHouse: Python objects to SQL types
NULL handling: pandas NA vs ClickHouse NULL
Boolean: Different boolean representations
DateTime: Timezone differences

Example

from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config

# Strict mode - expect exact type matches
config.set_correction_level(CorrectionLevel.NONE)

# Relaxed mode - auto-fix type issues
config.set_correction_level(CorrectionLevel.ALL)

Function Configuration API

function_config Object

from chdb.datastore.config import function_config

# Force engine for functions
function_config.use_chdb(*function_names)
function_config.use_pandas(*function_names)

# Set default preference
function_config.prefer_chdb()
function_config.prefer_pandas()

# Reset to default (auto)
function_config.reset()

# Check configuration
function_config.get_engine('length')  # Returns 'chdb', 'pandas', or 'auto'

Per-Call Override

Some methods support per-call engine override:

# Using engine parameter (where supported)
ds['result'] = ds['col'].str.upper(engine='pandas')

Best Practices

1. Start with Defaults

# Use auto mode, let DataStore decide
config.use_auto()

2. Configure for Specific Workloads

# For ClickHouse-optimized string processing
function_config.use_chdb('length', 'substring', 'concat')

# For pandas-compatible string behavior
function_config.use_pandas('upper', 'lower')

3. Use Appropriate Correction Level

# Development: more permissive
config.set_correction_level(CorrectionLevel.ALL)

# Production: stricter
config.set_correction_level(CorrectionLevel.HIGH)

4. Test Both Engines

# Test with chdb
config.use_chdb()
result_chdb = process_data()

# Test with pandas
config.use_pandas()
result_pandas = process_data()

# Compare results
assert result_chdb.equals(result_pandas)

Function Engine Configuration​

Setting Function Engines​

When to Use​

Example​

Overlapping Functions​

chdb-Only Functions​

pandas-Only Functions​

Dtype Correction​

Correction Levels​

Correction Level Details​

When Types Need Correction​

Example​

Function Configuration API​

function_config Object​

Per-Call Override​

Best Practices​

1. Start with Defaults​

2. Configure for Specific Workloads​

3. Use Appropriate Correction Level​

4. Test Both Engines​

Function Engine Configuration

Setting Function Engines

When to Use

Example

Overlapping Functions

chdb-Only Functions

pandas-Only Functions

Dtype Correction

Correction Levels

Correction Level Details

When Types Need Correction

Example

Function Configuration API

function_config Object

Per-Call Override

Best Practices

1. Start with Defaults

2. Configure for Specific Workloads

3. Use Appropriate Correction Level

4. Test Both Engines