Skip to main content
Skip to main content

Function-Level Configuration

DataStore allows fine-grained control over execution at the function level, including engine selection and Dtype correction.

Function Engine Configuration

Override the execution engine for specific functions.

Setting Function Engines

from chdb.datastore.config import function_config

# Force specific functions to use chdb
function_config.use_chdb('length', 'substring', 'concat')

# Force specific functions to use pandas
function_config.use_pandas('upper', 'lower', 'capitalize')

# Set default preference
function_config.prefer_chdb()    # Default to chdb
function_config.prefer_pandas()  # Default to pandas

# Reset to auto
function_config.reset()

When to Use

Force chdb for:

  • Functions with better ClickHouse performance
  • Functions that benefit from SQL optimization
  • Large-scale string/datetime operations

Force pandas for:

  • Functions with pandas-specific behavior
  • When exact pandas compatibility is required
  • Custom string operations

Example

from chdb import datastore as pd
from chdb.datastore.config import function_config

# Configure function engines
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper')

ds = pd.read_csv("data.csv")

# length() will use chdb
ds['name_len'] = ds['name'].str.len()

# substring() will use chdb  
ds['prefix'] = ds['name'].str.slice(0, 3)

# upper() will use pandas
ds['name_upper'] = ds['name'].str.upper()

Overlapping Functions

159+ functions are available in both chdb and pandas engines:

CategoryFunctions
Stringlength, upper, lower, trim, ltrim, rtrim, concat, substring, replace, reverse, contains, startswith, endswith
Mathabs, round, floor, ceil, exp, log, log10, sqrt, pow, sin, cos, tan
DateTimeyear, month, day, hour, minute, second, dayofweek, dayofyear, quarter
Aggregationsum, avg, min, max, count, std, var, median

For overlapping functions, the engine is selected based on:

  1. Explicit function configuration (if set)
  2. Global execution_engine setting
  3. Auto-selection based on context

chdb-Only Functions

Some functions are only available through ClickHouse:

CategoryFunctions
ArrayarraySum, arrayAvg, arraySort, arrayDistinct, groupArray, arrayElement
JSONJSONExtractString, JSONExtractInt, JSONExtractFloat, JSONHas
URLdomain, path, protocol, extractURLParameter
IPIPv4StringToNum, IPv4NumToString, isIPv4String
GeogreatCircleDistance, geoDistance, geoToH3
HashcityHash64, xxHash64, sipHash64, MD5, SHA256
ConditionalsumIf, countIf, avgIf, minIf, maxIf

These functions automatically use chdb engine regardless of configuration.


pandas-Only Functions

Some functions are only available through pandas:

CategoryFunctions
ApplyCustom lambda functions, user-defined functions
Complex PivotPivot tables with custom aggregations
Stack/UnstackComplex reshaping operations
InterpolateTime series interpolation methods

These functions automatically use pandas engine regardless of configuration.


Dtype Correction

Configure how DataStore corrects data types between engines.

Correction Levels

from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config

# No correction
config.set_correction_level(CorrectionLevel.NONE)

# Critical types only (NULL handling, boolean)
config.set_correction_level(CorrectionLevel.CRITICAL)

# High priority (default) - common type mismatches
config.set_correction_level(CorrectionLevel.HIGH)

# Medium - more aggressive correction
config.set_correction_level(CorrectionLevel.MEDIUM)

# All - correct all possible types
config.set_correction_level(CorrectionLevel.ALL)

Correction Level Details

LevelDescriptionTypes Corrected
NONENo automatic correctionNone
CRITICALEssential correctionsNULL handling, boolean conversion
HIGH (default)Common correctionsInteger/float precision, datetime, string encoding
MEDIUMMore correctionsDecimal precision, timezone handling
ALLMaximum correctionAll type differences

When Types Need Correction

Type differences can occur when:

  1. ClickHouse → pandas: Different integer sizes (Int64 vs int64)
  2. pandas → ClickHouse: Python objects to SQL types
  3. NULL handling: pandas NA vs ClickHouse NULL
  4. Boolean: Different boolean representations
  5. DateTime: Timezone differences

Example

from chdb.datastore.dtype_correction.config import CorrectionLevel
from chdb.datastore.config import config

# Strict mode - expect exact type matches
config.set_correction_level(CorrectionLevel.NONE)

# Relaxed mode - auto-fix type issues
config.set_correction_level(CorrectionLevel.ALL)

Function Configuration API

function_config Object

from chdb.datastore.config import function_config

# Force engine for functions
function_config.use_chdb(*function_names)
function_config.use_pandas(*function_names)

# Set default preference
function_config.prefer_chdb()
function_config.prefer_pandas()

# Reset to default (auto)
function_config.reset()

# Check configuration
function_config.get_engine('length')  # Returns 'chdb', 'pandas', or 'auto'

Per-Call Override

Some methods support per-call engine override:

# Using engine parameter (where supported)
ds['result'] = ds['col'].str.upper(engine='pandas')

Best Practices

1. Start with Defaults

# Use auto mode, let DataStore decide
config.use_auto()

2. Configure for Specific Workloads

# For ClickHouse-optimized string processing
function_config.use_chdb('length', 'substring', 'concat')

# For pandas-compatible string behavior
function_config.use_pandas('upper', 'lower')

3. Use Appropriate Correction Level

# Development: more permissive
config.set_correction_level(CorrectionLevel.ALL)

# Production: stricter
config.set_correction_level(CorrectionLevel.HIGH)

4. Test Both Engines

# Test with chdb
config.use_chdb()
result_chdb = process_data()

# Test with pandas
config.use_pandas()
result_pandas = process_data()

# Compare results
assert result_chdb.equals(result_pandas)