To read an NPY file in Python, use chDB to load the array into a DataFrame and work with it using the pandas API you already know:
1from chdb.datastore import DataStore
2df = DataStore.from_file("data/readings.npy", format="Npy")df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.
1pip install chdbRead a .npy file into a DataFrame #
DataStore.from_file reads the array into a lazy, ClickHouse-backed object. Point it at the file with format="Npy" and print the result:
1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/readings.npy", format="Npy")
4print(df)
5print(df.dtypes)1 array
20 50.00
31 63.09
42 74.73
53 83.66
64 88.88
75 89.82
86 86.37
97 78.92
10array float64
11dtype: objectOne thing is worth knowing about the shape: a .npy file stores exactly one numeric array and carries no column names, so chDB exposes a single column called array. The dtype is inferred from the array's element type: float64 here because the file holds 64-bit floats.
Like the CSV and JSON readers, df is not the whole array loaded into memory. It is a lazy object that runs on ClickHouse's engine when you materialize a result by printing, calling len(), or invoking .to_pandas().
Filter and aggregate the way you already do #
Use the pandas API you already know to aggregate the column:
1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/readings.npy", format="Npy")
4mean_val = df["array"].mean()
5print(f"mean: {float(mean_val):.2f}")
6
7max_val = df["array"].max()
8min_val = df["array"].min()
9print(f"max: {float(max_val):.2f} min: {float(min_val):.2f}")1mean: 76.93
2max: 89.82 min: 50.00The same approach works for boolean indexing after you materialize to real pandas with .to_pandas():
1pdf = df.to_pandas()
2good = pdf[pdf["array"] > 75]
3print(f"readings above 75: {len(good)}")1readings above 75: 5A 2-D .npy reads as one row per outer element #
A .npy can hold a matrix as well as a flat array. Save one with np.save("matrix.npy", np.arange(12).reshape(4, 3)) and chDB reads each outer element as a row in the array column:
1mat = DataStore.from_file("data/matrix.npy", format="Npy")
2print(mat)1 array
20 [0, 1, 2]
31 [3, 4, 5]
42 [6, 7, 8]
53 [9, 10, 11]Each element in the array column is a list. Call .to_pandas() to hand it off to pandas or NumPy for further processing.
Join two arrays by position #
NumPy workflows often keep parallel arrays in separate files: readings in one .npy, a quality flag in another, aligned by position. Load each file, hand both off to real pandas, and join on the index:
1readings_pdf = (
2 DataStore.from_file("data/readings.npy", format="Npy")
3 .to_pandas()
4 .rename(columns={"array": "reading"})
5)
6flags_pdf = (
7 DataStore.from_file("data/flags.npy", format="Npy")
8 .to_pandas()
9 .rename(columns={"array": "ok"})
10)
11combined = readings_pdf.join(flags_pdf)
12print(combined)1 reading ok
20 50.00 0
31 63.09 1
42 74.73 1
53 83.66 1
64 88.88 0
75 89.82 1
86 86.37 1
97 78.92 1From there, filter to only the rows where ok == 1 and compute the mean in pandas:
1mean_ok = combined[combined["ok"] == 1]["reading"].mean()
2print(f"mean of flagged readings: {mean_ok:.2f}")Hand off to real pandas when you need it #
When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:
1df = DataStore.from_file("data/readings.npy", format="Npy")
2pdf = df.to_pandas() # a real pandas.DataFrame, in memory1<class 'pandas.DataFrame'>The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast on large files, then .to_pandas() the small result and continue in ordinary pandas.
Works in Jupyter #
Printing a chDB DataStore object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.
Run it yourself #
The complete, runnable example is here, with generate.sh to create the sample .npy files plus run.ipynb and a run.py mirror containing the exact code above:
github.com/ClickHouse/examples/tree/main/local-analytics/chdb-npy
Working with other formats? Read a JSON file in Python and read a Parquet file in Python use the same chDB datastore pattern; only the reader changes. The same code scales from a file on your laptop to ClickHouse Cloud with no rewrite.