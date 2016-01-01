Key Differences from pandas

While DataStore is highly compatible with pandas, there are important differences to understand.

Aspect pandas DataStore Execution Eager (immediate) Lazy (deferred) Return types DataFrame/Series DataStore/ColumnExpr Row order Preserved Preserved (automatic) inplace Supported Not supported Index Full support Simplified Memory All data in memory Data at source

Operations execute immediately:

import pandas as pd df = pd.read_csv("data.csv") # Loads entire file NOW result = df[df['age'] > 25] # Filters NOW grouped = result.groupby('city')['salary'].mean() # Aggregates NOW

Operations are deferred until results are needed:

from chdb import datastore as pd ds = pd.read_csv("data.csv") # Just records the source result = ds[ds['age'] > 25] # Just records the filter grouped = result.groupby('city')['salary'].mean() # Just records # Execution happens here: print(grouped) # Executes when displaying df = grouped.to_df() # Or when converting to pandas

Lazy execution enables:

Query optimization : Multiple operations compile to one SQL query

: Multiple operations compile to one SQL query Column pruning : Only needed columns are read

: Only needed columns are read Filter pushdown : Filters apply at the source

: Filters apply at the source Memory efficiency: Don't load data you don't need

df['col'] # Returns pd.Series df[['a', 'b']] # Returns pd.DataFrame df[df['x'] > 10] # Returns pd.DataFrame df.groupby('x') # Returns DataFrameGroupBy

ds['col'] # Returns ColumnExpr (lazy) ds[['a', 'b']] # Returns DataStore (lazy) ds[ds['x'] > 10] # Returns DataStore (lazy) ds.groupby('x') # Returns LazyGroupBy

# Get pandas DataFrame df = ds.to_df() df = ds.to_pandas() # Get pandas Series from column series = ds['col'].to_pandas() # Or trigger execution print(ds) # Automatically converts for display

DataStore executes when you need actual values:

Trigger Example Notes print() / repr() print(ds) Display needs data len() len(ds) Need row count .columns ds.columns Need column names .dtypes ds.dtypes Need type info .shape ds.shape Need dimensions .values ds.values Need actual data .index ds.index Need index to_df() ds.to_df() Explicit conversion Iteration for row in ds Need to iterate equals() ds.equals(other) Need comparison

Operation Returns filter() DataStore select() DataStore sort() DataStore groupby() LazyGroupBy join() DataStore ds['col'] ColumnExpr ds[['a', 'b']] DataStore ds[condition] DataStore

Row order is always preserved:

df = pd.read_csv("data.csv") print(df.head()) # Always same order as file

Row order is automatically preserved for most operations:

ds = pd.read_csv("data.csv") print(ds.head()) # Matches file order # Filter preserves order ds_filtered = ds[ds['age'] > 25] # Same order as pandas

DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks() ) to ensure order consistency with pandas.

File sources (CSV, Parquet, JSON, etc.)

pandas DataFrame sources

Filter operations

Column selection

After explicit sort() or sort_values()

or Operations that define order ( nlargest() , nsmallest() , head() , tail() )

After groupby() aggregations (use sort_values() to ensure consistent order)

aggregations (use to ensure consistent order) After merge() / join() with certain join types

df.drop(columns=['col'], inplace=True) # Modifies df df.fillna(0, inplace=True) # Modifies df df.rename(columns={'old': 'new'}, inplace=True)

inplace=True is not supported. Always assign the result:

ds = ds.drop(columns=['col']) # Returns new DataStore ds = ds.fillna(0) # Returns new DataStore ds = ds.rename(columns={'old': 'new'}) # Returns new DataStore

DataStore uses immutable operations to enable:

Query building (lazy evaluation)

Thread safety

Easier debugging

Cleaner code

Full index support:

df = df.set_index('id') df.loc['user123'] # Label-based access df.loc['a':'z'] # Label-based slicing df.reset_index() df.index.name = 'user_id'

Simplified index support:

# Basic operations work ds.loc[0:10] # Integer position ds.iloc[0:10] # Same as loc for DataStore # For pandas-style index operations, convert first df = ds.to_df() df = df.set_index('id') df.loc['user123']

DataFrame source : Preserves pandas index

: Preserves pandas index File source: Uses simple integer index

pandas doesn't recognize DataStore objects:

import pandas as pd from chdb import datastore as ds pdf = pd.DataFrame({'a': [1, 2, 3]}) dsf = ds.DataFrame({'a': [1, 2, 3]}) # This doesn't work as expected pdf == dsf # pandas doesn't know DataStore # Solution: convert DataStore to pandas pdf.equals(dsf.to_pandas()) # True

# DataStore.equals() also works dsf.equals(pdf) # Compares with pandas DataFrame

Uses numpy/pandas types:

df['col'].dtype # int64, float64, object, datetime64, etc.

May use ClickHouse types:

ds['col'].dtype # Int64, Float64, String, DateTime, etc. # Types are converted when going to pandas df = ds.to_df() df['col'].dtype # Now pandas type

# Force specific type ds['col'] = ds['col'].astype('int64')

All data lives in memory:

df = pd.read_csv("huge.csv") # 10GB in memory!

Data stays at source until needed:

ds = pd.read_csv("huge.csv") # Just metadata ds = ds.filter(ds['year'] == 2024) # Still just metadata # Only filtered result is loaded df = ds.to_df() # Maybe only 1GB now

pandas errors : From pandas library

: From pandas library DataStore errors: From chDB or ClickHouse

# May see ClickHouse-style errors # "Code: 62. DB::Exception: Syntax error..."

# View the SQL to debug print(ds.to_sql()) # See execution plan ds.explain() # Enable debug logging from chdb.datastore.config import config config.enable_debug()

When migrating from pandas:

Change import statement

Change import statement Remove inplace=True parameters

Remove parameters Add explicit to_df() where pandas DataFrame is required

Add explicit where pandas DataFrame is required Add sorting if row order matters

Add sorting if row order matters Use to_pandas() for comparison tests

Use for comparison tests Test with representative data sizes