How to read an NPY file in Python

To read an NPY file in Python, use chDB to load the array into a DataFrame and work with it using the pandas API you already know:

1from chdb.datastore import DataStore
2df = DataStore.from_file("data/readings.npy", format="Npy")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Read a .npy file into a DataFrame

DataStore.from_file reads the array into a lazy, ClickHouse-backed object. Point it at the file with format="Npy" and print the result:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/readings.npy", format="Npy")
4print(df)
5print(df.dtypes)

1array
20  50.00
31  63.09
42  74.73
53  83.66
64  88.88
75  89.82
86  86.37
97  78.92
10array    float64
11dtype: object

One thing is worth knowing about the shape: a .npy file stores exactly one numeric array and carries no column names, so chDB exposes a single column called array. The dtype is inferred from the array's element type: float64 here because the file holds 64-bit floats.

Like the CSV and JSON readers, df is not the whole array loaded into memory. It is a lazy object that runs on ClickHouse's engine when you materialize a result by printing, calling len(), or invoking .to_pandas().

Filter and aggregate the way you already do

Use the pandas API you already know to aggregate the column:

1from chdb.datastore import DataStore
2
3df = DataStore.from_file("data/readings.npy", format="Npy")
4mean_val = df["array"].mean()
5print(f"mean: {float(mean_val):.2f}")
6
7max_val = df["array"].max()
8min_val = df["array"].min()
9print(f"max: {float(max_val):.2f}  min: {float(min_val):.2f}")

1mean: 76.93
2max: 89.82  min: 50.00

The same approach works for boolean indexing after you materialize to real pandas with .to_pandas():

1pdf = df.to_pandas()
2good = pdf[pdf["array"] > 75]
3print(f"readings above 75: {len(good)}")

1readings above 75: 5

A 2-D .npy reads as one row per outer element

A .npy can hold a matrix as well as a flat array. Save one with np.save("matrix.npy", np.arange(12).reshape(4, 3)) and chDB reads each outer element as a row in the array column:

1mat = DataStore.from_file("data/matrix.npy", format="Npy")
2print(mat)

1array
20    [0, 1, 2]
31    [3, 4, 5]
42    [6, 7, 8]
53  [9, 10, 11]

Each element in the array column is a list. Call .to_pandas() to hand it off to pandas or NumPy for further processing.

Join two arrays by position

NumPy workflows often keep parallel arrays in separate files: readings in one .npy, a quality flag in another, aligned by position. Load each file, hand both off to real pandas, and join on the index:

1readings_pdf = (
2    DataStore.from_file("data/readings.npy", format="Npy")
3    .to_pandas()
4    .rename(columns={"array": "reading"})
5)
6flags_pdf = (
7    DataStore.from_file("data/flags.npy", format="Npy")
8    .to_pandas()
9    .rename(columns={"array": "ok"})
10)
11combined = readings_pdf.join(flags_pdf)
12print(combined)

1reading  ok
20    50.00   0
31    63.09   1
42    74.73   1
53    83.66   1
64    88.88   0
75    89.82   1
86    86.37   1
97    78.92   1

From there, filter to only the rows where ok == 1 and compute the mean in pandas:

1mean_ok = combined[combined["ok"] == 1]["reading"].mean()
2print(f"mean of flagged readings: {mean_ok:.2f}")

Hand off to real pandas when you need it

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = DataStore.from_file("data/readings.npy", format="Npy")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

1<class 'pandas.DataFrame'>

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast on large files, then .to_pandas() the small result and continue in ordinary pandas.

Works in Jupyter

Printing a chDB DataStore object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself

The complete, runnable example is here, with generate.sh to create the sample .npy files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-npy

Working with other formats? Read a JSON file in Python and read a Parquet file in Python use the same chDB datastore pattern; only the reader changes. The same code scales from a file on your laptop to ClickHouse Cloud with no rewrite.

Read a .npy file into a DataFrame

Filter and aggregate the way you already do

A 2-D .npy reads as one row per outer element

Join two arrays by position

Hand off to real pandas when you need it

Works in Jupyter

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to read an NPY file in Python

Read a .npy file into a DataFrame #

Filter and aggregate the way you already do #

A 2-D .npy reads as one row per outer element #

Join two arrays by position #

Hand off to real pandas when you need it #

Works in Jupyter #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Read a .npy file into a DataFrame

Filter and aggregate the way you already do

A 2-D .npy reads as one row per outer element

Join two arrays by position

Hand off to real pandas when you need it

Works in Jupyter

Run it yourself