How to read a .npy (NumPy) file with SQL

To read a .npy (NumPy) file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then name the Npy format and query the file:

1clickhouse local -q "SELECT * FROM file('revenue.npy', Npy) LIMIT 5"

1┌─array─┐
21. │   100 │
32. │ 107.1 │
43. │ 114.2 │
54. │ 121.3 │
65. │ 128.4 │
7   └───────┘

The file is read in place with no import step. An .npy file embeds its own dtype in the header, so ClickHouse resolves the column type from the file itself and returns the array as a single column called array.

Name the format: .npy is not auto-detected

This is the one gotcha that trips people up. Most formats are inferred from the file extension, so file('data.parquet') and file('data.csv') just work. .npy is the exception. The extension is not enough; you have to pass the Npy format as the second argument, every time:

1clickhouse local -q "SELECT * FROM file('revenue.npy', Npy) LIMIT 5"

Drop the Npy and ClickHouse will not know how to parse the file. Once you name it, everything else behaves like any other table.

See the dtype without declaring a schema

An .npy file carries its own dtype in the header (the >f8, <i4 and friends NumPy writes). DESCRIBE reads it back as a ClickHouse type, so you never write CREATE TABLE:

1clickhouse local -q "DESCRIBE file('revenue.npy', Npy) FORMAT PrettyCompact"

1┌─name──┬─type────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ array │ Float64 │              │                    │         │                  │                │
3   └───────┴─────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

A NumPy float64 array maps to Float64, int32 to Int32, and so on. The single column is always named array, because the file holds exactly one array and nothing else: no header row, no column names. (If you saved a 2D array, the column comes back as Array(Float32), one nested array per row.)

Rename the column and aggregate

The full SQL surface works on the array (WHERE, GROUP BY, aggregate functions, quantiles). Refer to the column as array:

1clickhouse local -q "
2SELECT count() AS n, round(avg(array), 2) AS mean, round(max(array), 2) AS max
3FROM file('revenue.npy', Npy)
4FORMAT PrettyCompact"

1┌──n─┬───mean─┬───max─┐
21. │ 10 │ 121.35 │ 149.7 │
3   └────┴────────┴───────┘

array is a clunky name to read SQL against. Give it a real one by passing an explicit structure as the third argument — the same place you would override an inferred type:

1clickhouse local -q "
2SELECT revenue FROM file('revenue.npy', Npy, 'revenue Float64')
3WHERE revenue > 130
4ORDER BY revenue DESC
5FORMAT PrettyCompact"

1┌─revenue─┐
21. │   149.7 │
32. │   142.6 │
43. │   135.5 │
5   └─────────┘

The type you declare must match the array's dtype; this renames the column, it does not cast across incompatible types.

Bridge from NumPy to SQL: zip arrays into rows

Because each .npy file holds a single array, a NumPy workflow that produces several parallel arrays (a price array, a quantity array, an embedding column) lands as several files. There is no shared key inside the files; the only thing linking them is row position, the array index.

To turn parallel arrays back into rows, attach a row number to each with rowNumberInAllBlocks() and join on it:

1clickhouse local -q "
2SELECT r.rn AS i, r.array AS revenue, q.array AS quantity
3FROM (SELECT rowNumberInAllBlocks() AS rn, array FROM file('revenue.npy', Npy)) AS r
4INNER JOIN (SELECT rowNumberInAllBlocks() AS rn, array FROM file('quantity.npy', Npy)) AS q
5USING rn
6ORDER BY i
7FORMAT PrettyCompact"

1┌─i─┬─revenue─┬─quantity─┐
2 1. │ 0 │     100 │        1 │
3 2. │ 1 │   107.1 │        4 │
4 3. │ 2 │   114.2 │        2 │
5 4. │ 3 │   121.3 │        5 │
6 5. │ 4 │   128.4 │        3 │
7 6. │ 5 │   135.5 │        1 │
8 7. │ 6 │   142.6 │        4 │
9 8. │ 7 │   149.7 │        2 │
10 9. │ 8 │   103.8 │        5 │
1110. │ 9 │   110.9 │        3 │
12    └───┴─────────┴──────────┘

Now the two arrays are a single table you can filter, group, and join further. This is the move whenever you have arrays saved straight out of NumPy and want them as relational rows without writing a Python merge step. NumPy and pandas handle this natively too, of course; the point of doing it in SQL is that the same query runs unchanged against the other formats sitting next to your .npy files (Parquet, CSV, Arrow) and against a server when the data grows.

How fast is it on a real file?

The .npy layout is a contiguous block of fixed-width numbers, so reading it is close to a raw mmap. On a 3,000,000-value Float64 array (scores_large.npy, ~23 MB, generated by the example folder below), computing the count, mean, and 95th percentile across every value runs in:

1clickhouse local -q "
2SELECT count(), round(avg(array), 3), round(quantile(0.95)(array), 3)
3FROM file('scores_large.npy', Npy)"

1┌─count()─┬─round(avg(array), 3)─┬─round(quanti⋯(array), 3)─┐
21. │ 3000000 │              499.929 │                  950.141 │
3   └─────────┴──────────────────────┴──────────────────────────┘

~0.17 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). The number is honest but small: 3M doubles is a modest array, and the read is essentially memory-bandwidth bound.

The same SQL scales unchanged

The query you just ran on a local array is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('scores_large.npy', Npy) for a table name and nothing else changes. You prototype against arrays on your laptop and ship the identical logic to production, with no separate "local dialect" to unlearn.

Run it yourself

The complete, runnable example lives here. It has generate.sh (writes the two small arrays and the ~23 MB perf array), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-npy

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-npy
3./generate.sh && ./run.sh

Prefer Python? → Read a .npy file in Python with chDB

Name the format: .npy is not auto-detected

See the dtype without declaring a schema

Rename the column and aggregate

Bridge from NumPy to SQL: zip arrays into rows

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to read a .npy (NumPy) file with SQL

Name the format: .npy is not auto-detected #

See the dtype without declaring a schema #

Rename the column and aggregate #

Bridge from NumPy to SQL: zip arrays into rows #

How fast is it on a real file? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Name the format: .npy is not auto-detected

See the dtype without declaring a schema

Rename the column and aggregate

Bridge from NumPy to SQL: zip arrays into rows

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Related