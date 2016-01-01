Key Differences from pandas
While DataStore is highly compatible with pandas, there are important differences to understand.
Summary Table
|Aspect
|pandas
|DataStore
|Execution
|Eager (immediate)
|Lazy (deferred)
|Return types
|DataFrame/Series
|DataStore/ColumnExpr
|Row order
|Preserved
|Preserved (automatic)
|inplace
|Supported
|Not supported
|Index
|Full support
|Simplified
|Memory
|All data in memory
|Data at source
1. Lazy vs Eager Execution
pandas (Eager)
Operations execute immediately:
DataStore (Lazy)
Operations are deferred until results are needed:
Why It Matters
Lazy execution enables:
- Query optimization: Multiple operations compile to one SQL query
- Column pruning: Only needed columns are read
- Filter pushdown: Filters apply at the source
- Memory efficiency: Don't load data you don't need
2. Return Types
pandas
DataStore
Converting to pandas Types
3. Execution Triggers
DataStore executes when you need actual values:
|Trigger
|Example
|Notes
print() /
repr()
print(ds)
|Display needs data
len()
len(ds)
|Need row count
.columns
ds.columns
|Need column names
.dtypes
ds.dtypes
|Need type info
.shape
ds.shape
|Need dimensions
.values
ds.values
|Need actual data
.index
ds.index
|Need index
to_df()
ds.to_df()
|Explicit conversion
|Iteration
for row in ds
|Need to iterate
equals()
ds.equals(other)
|Need comparison
Operations That Stay Lazy
|Operation
|Returns
filter()
|DataStore
select()
|DataStore
sort()
|DataStore
groupby()
|LazyGroupBy
join()
|DataStore
ds['col']
|ColumnExpr
ds[['a', 'b']]
|DataStore
ds[condition]
|DataStore
4. Row Order
pandas
Row order is always preserved:
DataStore
Row order is automatically preserved for most operations:
DataStore automatically tracks original row positions internally (using
rowNumberInAllBlocks()) to ensure order consistency with pandas.
When Order Is Preserved
- File sources (CSV, Parquet, JSON, etc.)
- pandas DataFrame sources
- Filter operations
- Column selection
- After explicit
sort()or
sort_values()
- Operations that define order (
nlargest(),
nsmallest(),
head(),
tail())
When Order May Differ
- After
groupby()aggregations (use
sort_values()to ensure consistent order)
- After
merge()/
join()with certain join types
5. No inplace Parameter
pandas
DataStore
inplace=True is not supported. Always assign the result:
Why No inplace?
DataStore uses immutable operations to enable:
- Query building (lazy evaluation)
- Thread safety
- Easier debugging
- Cleaner code
6. Index Support
pandas
Full index support:
DataStore
Simplified index support:
DataStore Source Matters
- DataFrame source: Preserves pandas index
- File source: Uses simple integer index
7. Comparison Behavior
Comparing with pandas
pandas doesn't recognize DataStore objects:
Using equals()
8. Type Inference
pandas
Uses numpy/pandas types:
DataStore
May use ClickHouse types:
Explicit Casting
9. Memory Model
pandas
All data lives in memory:
DataStore
Data stays at source until needed:
10. Error Messages
Different Error Sources
- pandas errors: From pandas library
- DataStore errors: From chDB or ClickHouse
Debugging Tips
Migration Checklist
When migrating from pandas:
- Change import statement
- Remove
inplace=Trueparameters
- Add explicit
to_df()where pandas DataFrame is required
- Add sorting if row order matters
- Use
to_pandas()for comparison tests
- Test with representative data sizes
Quick Reference
|pandas
|DataStore
df[condition]
|Same (returns DataStore)
df.groupby()
|Same (returns LazyGroupBy)
df.drop(inplace=True)
ds = ds.drop()
df.equals(other)
ds.to_pandas().equals(other)
df.loc['label']
ds.to_df().loc['label']
print(df)
|Same (triggers execution)
len(df)
|Same (triggers execution)