Migration from pandas

This guide helps you migrate existing pandas code to DataStore for better performance while maintaining compatibility.

The simplest migration is changing your import:

# Before (pandas) import pandas as pd # After (DataStore) from chdb import datastore as pd

That's it! Most pandas code works unchanged.

Install chDB pip install "chdb>=4.0" Change the Import # Change this: import pandas as pd # To this: from chdb import datastore as pd Test Your Code Run your existing code. Most operations work unchanged: from chdb import datastore as pd # These all work the same df = pd.read_csv("data.csv") result = df[df['age'] > 25] grouped = df.groupby('city')['salary'].mean() df.to_csv("output.csv") Handle Any Differences A few operations behave differently. See Key Differences below.

# All these work the same df = pd.read_csv("data.csv") df = pd.read_parquet("data.parquet") df = pd.read_json("data.json") df = pd.read_excel("data.xlsx")

# Boolean indexing df[df['age'] > 25] df[(df['age'] > 25) & (df['city'] == 'NYC')] # query() method df.query('age > 25 and salary > 50000')

# Column selection df['name'] df[['name', 'age']] # Row selection df.head(10) df.tail(10) df.iloc[0:100]

# GroupBy df.groupby('city')['salary'].mean() df.groupby(['city', 'dept']).agg({'salary': ['sum', 'mean']})

df.sort_values('salary', ascending=False) df.sort_values(['city', 'age'])

df['name'].str.upper() df['name'].str.contains('John') df['name'].str.len()

df['date'].dt.year df['date'].dt.month df['date'].dt.dayofweek

df.to_csv("output.csv") df.to_parquet("output.parquet") df.to_json("output.json")

DataStore operations are lazy - they don't execute until results are needed.

pandas:

# Executes immediately result = df[df['age'] > 25] print(type(result)) # pandas.DataFrame

DataStore:

# Builds query, doesn't execute yet result = ds[ds['age'] > 25] print(type(result)) # DataStore (lazy) # Executes when you need the data print(result) # Triggers execution df = result.to_df() # Triggers execution

Operation pandas Returns DataStore Returns df['col'] Series ColumnExpr (lazy) df[['a', 'b']] DataFrame DataStore (lazy) df[condition] DataFrame DataStore (lazy) df.groupby('x') GroupBy LazyGroupBy

DataStore doesn't support inplace=True . Always use the return value:

pandas:

df.drop(columns=['col'], inplace=True)

DataStore:

ds = ds.drop(columns=['col']) # Assign the result

pandas doesn't recognize DataStore objects, so use to_pandas() for comparison:

# This may not work as expected df == ds # pandas doesn't know DataStore # Do this instead df.equals(ds.to_pandas())

DataStore may not preserve row order for file sources (like SQL databases). Use explicit sorting:

# pandas preserves order df = pd.read_csv("data.csv") # DataStore - use sort for guaranteed order ds = pd.read_csv("data.csv") ds = ds.sort('id') # Explicit ordering

# pandas import pandas as pd df = pd.read_csv("data.csv") result = df[df['amount'] > 100].groupby('category')['amount'].sum() result.to_csv("output.csv") # DataStore - same code works! from chdb import datastore as pd df = pd.read_csv("data.csv") result = df[df['amount'] > 100].groupby('category')['amount'].sum() result.to_csv("output.csv")

If you need pandas-specific features, convert at the end:

from chdb import datastore as pd # Fast DataStore operations ds = pd.read_csv("large_data.csv") ds = ds.filter(ds['date'] >= '2024-01-01') ds = ds.filter(ds['amount'] > 100) # Convert to pandas for specific features df = ds.to_df() df_pivoted = df.pivot_table(...) # pandas-specific

from chdb import datastore as pd import pandas # Start with DataStore for fast filtering ds = pd.read_csv("huge_file.csv") # 10M rows ds = ds.filter(ds['year'] == 2024) # Fast SQL filter ds = ds.select('col1', 'col2', 'col3') # Column pruning # Convert for pandas-specific operations df = ds.to_df() # Now only ~100K rows result = df.apply(complex_custom_function) # pandas

DataStore is significantly faster for large datasets:

Operation pandas DataStore Speedup GroupBy count 347ms 17ms 19.93x Complex pipeline 2,047ms 380ms 5.39x Filter+Sort+Head 1,537ms 350ms 4.40x GroupBy agg 406ms 141ms 2.88x

Benchmark on 10M rows

Some pandas operations may not be supported. Check:

Is the operation in the compatibility list? Try converting to pandas first: ds.to_df().operation()

Enable debug logging to understand what's happening:

from chdb.datastore.config import config config.enable_debug() # View the SQL being generated ds.filter(ds['x'] > 10).explain()

Check your execution pattern:

# Bad: Multiple small executions for i in range(1000): result = ds.filter(ds['id'] == i).to_df() # Good: Single execution result = ds.filter(ds['id'].isin(ids)).to_df()

DataStore may infer types differently:

# Check types print(ds.dtypes) # Force conversion ds['col'] = ds['col'].astype('int64')

# Keep both imports import pandas as pd from chdb import datastore as ds # Compare results pdf = pd.read_csv("data.csv") dsf = ds.read_csv("data.csv") # Verify they match assert pdf.equals(dsf.to_pandas())

Start with scripts that:

Read large files

Do filtering and aggregation

Don't use custom apply functions

For scripts with custom functions:

from chdb import datastore as pd # Let DataStore handle the heavy lifting ds = pd.read_csv("data.csv") ds = ds.filter(ds['year'] == 2024) # SQL # Convert for custom work df = ds.to_df() result = df.apply(my_custom_function)

Switch all scripts to DataStore import.

Yes! Convert between them freely:

from chdb import datastore as ds import pandas as pd # DataStore to pandas df = ds_result.to_pandas() # pandas to DataStore ds = ds.DataFrame(pd_result)

Most tests should pass. For comparison tests, convert to pandas:

def test_my_function(): result = my_function() expected = pd.DataFrame(...) pd.testing.assert_frame_equal(result.to_pandas(), expected)

Yes! DataStore works in Jupyter notebooks:

from chdb import datastore as pd ds = pd.read_csv("data.csv") ds.head() # Displays nicely in Jupyter

If you find compatibility issues, report them at: https://github.com/chdb-io/chdb/issues