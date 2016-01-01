Function-Level Configuration

DataStore allows fine-grained control over execution at the function level, including engine selection and Dtype correction.

Override the execution engine for specific functions.

from chdb.datastore.config import function_config # Force specific functions to use chdb function_config.use_chdb('length', 'substring', 'concat') # Force specific functions to use pandas function_config.use_pandas('upper', 'lower', 'capitalize') # Set default preference function_config.prefer_chdb() # Default to chdb function_config.prefer_pandas() # Default to pandas # Reset to auto function_config.reset()

Force chdb for:

Functions with better ClickHouse performance

Functions that benefit from SQL optimization

Large-scale string/datetime operations

Force pandas for:

Functions with pandas-specific behavior

When exact pandas compatibility is required

Custom string operations

from chdb import datastore as pd from chdb.datastore.config import function_config # Configure function engines function_config.use_chdb('length', 'substring') function_config.use_pandas('upper') ds = pd.read_csv("data.csv") # length() will use chdb ds['name_len'] = ds['name'].str.len() # substring() will use chdb ds['prefix'] = ds['name'].str.slice(0, 3) # upper() will use pandas ds['name_upper'] = ds['name'].str.upper()

159+ functions are available in both chdb and pandas engines:

Category Functions String length , upper , lower , trim , ltrim , rtrim , concat , substring , replace , reverse , contains , startswith , endswith Math abs , round , floor , ceil , exp , log , log10 , sqrt , pow , sin , cos , tan DateTime year , month , day , hour , minute , second , dayofweek , dayofyear , quarter Aggregation sum , avg , min , max , count , std , var , median

For overlapping functions, the engine is selected based on:

Explicit function configuration (if set) Global execution_engine setting Auto-selection based on context

Some functions are only available through ClickHouse:

Category Functions Array arraySum , arrayAvg , arraySort , arrayDistinct , groupArray , arrayElement JSON JSONExtractString , JSONExtractInt , JSONExtractFloat , JSONHas URL domain , path , protocol , extractURLParameter IP IPv4StringToNum , IPv4NumToString , isIPv4String Geo greatCircleDistance , geoDistance , geoToH3 Hash cityHash64 , xxHash64 , sipHash64 , MD5 , SHA256 Conditional sumIf , countIf , avgIf , minIf , maxIf

These functions automatically use chdb engine regardless of configuration.

Some functions are only available through pandas:

Category Functions Apply Custom lambda functions, user-defined functions Complex Pivot Pivot tables with custom aggregations Stack/Unstack Complex reshaping operations Interpolate Time series interpolation methods

These functions automatically use pandas engine regardless of configuration.

Configure how DataStore corrects data types between engines.

from chdb.datastore.dtype_correction.config import CorrectionLevel from chdb.datastore.config import config # No correction config.set_correction_level(CorrectionLevel.NONE) # Critical types only (NULL handling, boolean) config.set_correction_level(CorrectionLevel.CRITICAL) # High priority (default) - common type mismatches config.set_correction_level(CorrectionLevel.HIGH) # Medium - more aggressive correction config.set_correction_level(CorrectionLevel.MEDIUM) # All - correct all possible types config.set_correction_level(CorrectionLevel.ALL)

Level Description Types Corrected NONE No automatic correction None CRITICAL Essential corrections NULL handling, boolean conversion HIGH (default) Common corrections Integer/float precision, datetime, string encoding MEDIUM More corrections Decimal precision, timezone handling ALL Maximum correction All type differences

Type differences can occur when:

ClickHouse → pandas: Different integer sizes (Int64 vs int64) pandas → ClickHouse: Python objects to SQL types NULL handling: pandas NA vs ClickHouse NULL Boolean: Different boolean representations DateTime: Timezone differences

from chdb.datastore.dtype_correction.config import CorrectionLevel from chdb.datastore.config import config # Strict mode - expect exact type matches config.set_correction_level(CorrectionLevel.NONE) # Relaxed mode - auto-fix type issues config.set_correction_level(CorrectionLevel.ALL)

from chdb.datastore.config import function_config # Force engine for functions function_config.use_chdb(*function_names) function_config.use_pandas(*function_names) # Set default preference function_config.prefer_chdb() function_config.prefer_pandas() # Reset to default (auto) function_config.reset() # Check configuration function_config.get_engine('length') # Returns 'chdb', 'pandas', or 'auto'

Some methods support per-call engine override:

# Using engine parameter (where supported) ds['result'] = ds['col'].str.upper(engine='pandas')

# Use auto mode, let DataStore decide config.use_auto()

# For ClickHouse-optimized string processing function_config.use_chdb('length', 'substring', 'concat') # For pandas-compatible string behavior function_config.use_pandas('upper', 'lower')

# Development: more permissive config.set_correction_level(CorrectionLevel.ALL) # Production: stricter config.set_correction_level(CorrectionLevel.HIGH)